Loan Default Prediction¶

Executive Summary: Loan Default Prediction Project¶

This project aimed to develop a robust and interpretable classification model to predict which home equity loan applicants are likely to default. By leveraging historical loan data and applying various machine learning techniques, the goal was to create a data-driven system that enables the financial institution to mitigate financial risks associated with non-performing assets, streamline the loan approval process, ensure regulatory compliance, and gain valuable insights into key drivers of default. The exercise involved a comprehensive process from understanding the problem context and data, thorough data preparation and exploratory analysis, building and evaluating various predictive models, and finally proposing the most suitable solution with clear next steps for implementation and continuous improvement.

Important Findings from Analysis:¶

The project began with a thorough Exploratory Data Analysis (EDA) of the Home Equity dataset (HMEQ), which contains information on 5,960 home equity loans. Key findings from this initial analysis highlighted critical factors influencing loan default:

  • Data Overview: The dataset comprised 13 features, including loan details, property information, credit history, and job details, with 'BAD' as the binary target variable indicating loan default. Initial inspection revealed the presence of missing values across several features and a significant class imbalance, with only about 20% of loans resulting in default. This class imbalance was a primary driver for employing techniques to ensure the model could effectively learn from the minority class.
  • Key Risk Indicators: EDA and subsequent model analysis consistently identified Debt-to-Income Ratio (DEBTINC), Number of Delinquent Credit Lines (DELINQ), and Number of Major Derogatory Reports (DEROG) as the most significant predictors of loan default. Higher values in these features were strongly associated with increased default risk across different analytical approaches and models. These features represent key components of an applicant's financial health and credit history, directly driving the likelihood of repayment or default.
  • Credit History Impact: Features related to credit history, such as the Age of the Oldest Credit Line (CLAGE) and Number of Recent Credit Inquiries (NINQ), also played an important role. A younger credit history and a higher number of recent inquiries were associated with increased risk. These factors provide crucial context about an applicant's past financial behavior and current credit-seeking activity, influencing their ability to manage new debt.
  • Property and Loan Value Insights: While not as strongly correlated as credit history features, the analysis suggested that lower values in Loan Amount (LOAN), Amount Due on Existing Mortgage (MORTDUE), and Current Value of the Property (VALUE) could be associated with higher default risk, particularly in conjunction with other risk factors. This highlights that the loan-to-value ratio and the overall financial leverage represented by the loan size relative to property value are important considerations.
  • Job Type and Loan Reason: EDA revealed that default rates varied across different job types and the reason for the loan, with 'Sales' and 'Self' job categories and 'HomeImp' loans showing slightly higher default rates. While having lower feature importance in some models, these categorical factors provide additional context about an applicant's financial stability and motivation for seeking a loan.
  • Data Quality: The dataset contained missing values and outliers, which were addressed through imputation and capping to prepare the data for modeling. Handling these data quality issues was essential to ensure the reliability of the subsequent analysis and model training.

Data Preparation and Handling:¶

Before model building, a series of data preparation steps were undertaken to transform the raw data into a format suitable for machine learning algorithms and address identified issues:

  • Outlier Treatment: Outliers in numerical features were addressed by capping extreme values using either the IQR method or the 95th percentile, depending on the distribution and nature of the feature. This aimed to reduce the influence of extreme values on model performance without removing valuable data points.
  • Missing Value Imputation: Missing values in categorical columns (REASON, JOB) were imputed with the mode, while missing values in numerical columns were imputed with the median. This strategy aimed to maintain the distribution of the data while handling missing information effectively.
  • Categorical Feature Encoding: Categorical variables were transformed into a numerical format suitable for modeling. The 'JOB' column was one-hot encoded to represent each category as a binary feature, and the 'REASON' column was label encoded based on the observed relationship with the target variable.
  • Multicollinearity Check: The Variance Inflation Factor (VIF) was calculated for the independent variables to check for multicollinearity, which can affect model stability. All VIF values were below 5, indicating that multicollinearity was not a significant issue among the chosen features.
  • Data Splitting: The prepared data was split into training (70%) and testing (30%) sets using stratified sampling. This ensured that the proportion of defaulted and non-defaulted loans was maintained in both sets, which is essential due to the class imbalance and allows for reliable evaluation on unseen data.
  • Addressing Class Imbalance (SMOTE): To further address the class imbalance in the training data and improve the model's ability to learn from the minority class, the Synthetic Minority Over-sampling Technique (SMOTE) was applied to the training set. This created synthetic instances of the minority class (defaulters), resulting in a more balanced training dataset for certain models, particularly the tree-based ones.

Model Building, Evaluation, and Comparison:¶

Three classification models were built and evaluated for their ability to predict loan default: Logistic Regression, Decision Tree, and Random Forest. The models were evaluated using key metrics, with a focus on Recall and F1-score for the defaulted class to prioritize minimizing false negatives, which represent a higher business cost.

  • Logistic Regression: An initial Logistic Regression model was built and evaluated. To address class imbalance, class weights were adjusted, and the model was also trained on SMOTE-resampled data. The threshold for classification was adjusted using the precision-recall curve to optimize for the F1-score, balancing precision and recall.
  • Decision Tree: A Decision Tree Classifier was built. Hyperparameter tuning using GridSearchCV was performed to mitigate overfitting observed in the initial model and improve performance on unseen data. A Decision Tree model was also trained on the SMOTE-resampled data to assess the impact of balancing the training set.
  • Random Forest: A Random Forest Classifier, an ensemble of Decision Trees, was built. Hyperparameter tuning was performed to optimize its performance. A Random Forest model was also trained on the SMOTE-resampled data to leverage the benefits of both ensembling and handling class imbalance.

Model Performance: Several classification algorithms (Logistic Regression, Decision Tree, Random Forest) were evaluated. The analysis demonstrated that addressing class imbalance, particularly through SMOTE (Synthetic Minority Over-sampling Technique), was essential for improving the models' ability to identify the minority class (defaulters).

Final Proposed Model Specifications:¶

  • Model Type: Random Forest Classifier trained with SMOTE: Random Forest is chosen due to its capability to handle large datasets and its robustness against overfitting. It generates multiple decision trees to make decisions, improving predictive accuracy.
  • Class Imbalance Handling: Trained using SMOTE (Synthetic Minority Over-sampling Technique) on the training data to balance the class distribution. SMOTE works by creating synthetic samples of the minority class by interpolating between existing minority instances and their nearest neighbors, rather than simply duplicating data. This helps the model learn the characteristics of the minority class more effectively without overfitting.
  • Hyperparameter Tuning: While hyperparameter tuning was explored for the Random Forest model, the analysis of the untuned Random Forest model trained with SMOTE already demonstrated outstanding performance. The presented performance metrics reflect the results of the untuned Random Forest model trained with SMOTE. Given the strong performance of the untuned Random Forest with SMOTE, and the computational cost and time required for extensive hyperparameter tuning, particularly with cross-validation on a resampled dataset, further tuning was deemed a potential next step rather than a necessary component of the immediate proposed solution. The focus shifted to evaluating the impact of SMOTE on the untuned model's performance.
  • Performance on Original Test Data: The model achieved the following key metrics on the unseen, original test data (which reflects the real-world imbalanced distribution):
    • Recall (Default - Class 1): 0.92 (Successfully identified 92% of actual defaulters)
    • Precision (Default - Class 1): 0.86 (When predicting default, was correct 86% of the time)
    • F1-score (Default - Class 1): 0.89 (Best balance between Recall and Precision for the minority class)
    • Accuracy: 0.95
  • Key Predictors (Post-Modeling): Feature importance analysis for this model confirmed that DELINQ, DEBTINC, and DEROG were the most influential features in predicting loan default, followed by NINQ and CLAGE.

Key Next Steps:¶

To effectively implement and leverage the developed loan default prediction solution, the following key steps are recommended:

These steps outline how to improve the solution and make the best of the solution, as well as the steps to be followed by the stakeholders:

  1. Operationalization and Deployment:
    • Stakeholder Action: The financial institution's IT and risk management teams should collaborate to integrate the Random Forest model trained with SMOTE into the existing loan application processing system.
    • Making the Best of the Solution: Establish a clear workflow for how model predictions will be used in the loan approval process. This could involve using the model's risk score as a key factor in automated approvals for low-risk applicants and flagging high-risk applicants for manual review by loan officers.
  2. Continuous Monitoring and Performance Evaluation:
    • Stakeholder Action: The risk management and analytics teams must establish a system to continuously monitor the model's performance in the production environment.
    • Improving the Solution: Track key metrics (Recall, Precision, F1-score, False Positive Rate, False Negative Rate) on an ongoing basis using new incoming loan data. Implement alerts for significant drops in performance that might indicate concept drift (changes in data patterns over time).
  3. Refinement of Decision Thresholds:
    • Stakeholder Action: Risk management and business leaders need to define the optimal prediction threshold based on the financial costs associated with false positives (lost revenue from rejected good loans) and false negatives (losses from defaulted loans).
    • Making the Best of the Solution: Adjust the model's prediction threshold based on this cost-benefit analysis to maximize the overall value to the institution.
  4. Enhanced Interpretability and Explainability:
    • Stakeholder Action: The analytics team should develop and implement tools (e.g., SHAP values, LIME) to generate explanations for the model's predictions, especially for loan rejections.
    • Making the Best of the Solution: Provide these explanations to loan officers and applicants (where legally required and appropriate) to build trust, ensure fairness, and meet regulatory requirements.
  5. Regular Model Retraining and Updating:
    • Stakeholder Action: The analytics team should establish a schedule for regularly retraining the model using the most recent loan data.
    • Improving the Solution: This is crucial to ensure the model remains accurate and relevant as economic conditions and applicant behaviors change over time.
  6. Feature Engineering and External Data Exploration:
    • Stakeholder Action: The analytics team can explore creating new features from existing data or incorporating external data sources (e.g., macroeconomic indicators) that might further improve the model's predictive power.
    • Improving the Solution: This ongoing research and development can lead to a more robust and accurate model over time.
  7. Stakeholder Training and Collaboration:
    • Stakeholder Action: Provide training to loan officers and relevant staff on how to interpret the model's predictions and explanations. Foster a collaborative environment where feedback from the business can be used to refine the model and its implementation.

By following these steps, the financial institution can effectively leverage the developed loan default prediction model to reduce risk, improve efficiency, and make more informed lending decisions, ultimately contributing to financial stability and growth.

Problem and Solution Summary¶

Summary of the Problem¶

The core problem addressed in this exercise is predicting loan defaults in a dataset of home equity loans. This is a critical task for financial institutions to mitigate financial losses from non-performing loans, improve risk assessment accuracy, and streamline the loan approval process. The current manual review process of loan applications is effort-intensive and prone to errors and biases, leading to potential financial losses from missed defaulters and inefficient processing. Therefore, the bank requires a model that can effectively automate credit scoring while remaining free from these biases and improving the accuracy of default prediction. A key challenge identified was the significant class imbalance in the dataset, with a much smaller number of defaulted loans compared to repaid loans. Additionally, the data contained missing values and outliers, requiring careful preprocessing.

Key Points Describing the Final Proposed Solution Design¶

The final proposed solution design is based on a data-driven predictive model using a Random Forest Classifier trained with SMOTE (Synthetic Minority Over-sampling Technique).

  1. Data-Driven Model: The solution will employ predictive modeling techniques to create a data-driven model based on the bank’s historical loan performance data.
  2. Classification Approach: A classification model, specifically the Random Forest Classifier trained with SMOTE, will be developed to predict the likelihood of loan defaults.
  3. Model Type: Random Forest Classifier, an ensemble learning method combining multiple decision trees for robust prediction. It was chosen for its capability to handle large datasets, its robustness against overfitting, and its ability to handle nonlinear relationships between variables, while still being interpretable enough to explain decisions on adverse outcomes.
  4. Feature Importance Analysis: The Random Forest model provides insights into feature importance, helping to identify which factors contribute most significantly to loan defaults.Interpretability: While Random Forest is an ensemble model, its feature importance and potential use with model-agnostic techniques (like SHAP/LIME) offer sufficient interpretability to support decisions and ensure compliance with regulations like the Equal Credit Opportunity Act.
  5. Interpretability: While Random Forest is an ensemble model, its feature importance and potential use with model-agnostic techniques (like SHAP/LIME) offer sufficient interpretability to support decisions and ensure compliance with regulations like the Equal Credit Opportunity Act.
  6. Class Imbalance Handling: SMOTE is applied to the training data to address the class imbalance by creating synthetic instances of the minority class (defaulters), enabling the model to learn effectively from both classes.
  7. Data Preprocessing: The solution incorporates essential preprocessing steps including imputation of missing values (median for numerical, mode for categorical), capping of outliers (IQR or 95th percentile) to handle extreme values, and encoding of categorical features (one-hot encoding and label encoding) for model compatibility.
  8. Key Predictive Features: The model leverages features strongly associated with default risk, most notably DELINQ (Number of Delinquent Credit Lines), DEBTINC (Debt-to-Income Ratio), and DEROG (Number of Major Derogatory Reports), as identified through comprehensive EDA and feature importance analysis.
  9. Performance Focus: The model is evaluated and selected based on its performance on the original, unseen test data, with a primary focus on achieving high Recall (minimizing false negatives) and a good F1-score for the defaulted class.

Reason for the Proposed Solution Design and Business Impact¶

The proposed solution is considered valid and likely to solve the problem due to its demonstrated effectiveness and anticipated positive impact on the business:

  • Reason for the Proposed Solution Design:

    • Superior Performance: The Random Forest Classifier trained with SMOTE was chosen due to its superior performance on the key evaluation metrics for the defaulted class (Class 1) on the original, unseen test data. It achieved the highest F1-score (0.89), demonstrating the best balance between correctly identifying actual defaulters (Recall: 0.92) and minimizing incorrect default predictions (Precision: 0.86).
    • Efficiency and Automation: The Random Forest model allows for the automation of the credit scoring process, significantly reducing the manual effort required and enabling the bank to process a higher volume of applications more swiftly.
    • Reducing Bias: A Random Forest model trained on a well-preprocessed and balanced dataset can provide a more objective and consistent decision-making framework compared to potentially subjective manual reviews, helping to mitigate human errors and biases.
    • Regulatory Compliance: While ensemble models like Random Forests are less inherently interpretable than single Decision Trees, the ability to extract feature importance and potentially use model-agnostic interpretability techniques (like SHAP or LIME) allows the bank to provide justification for loan rejections, which is crucial for maintaining compliance with regulations.
    • Comparing to Other Evaluated Models:
    • Vs. Tuned Logistic Regression (Recall: 0.79, Precision: 0.33, F1: 0.46): The Random Forest with SMOTE shows significantly higher Recall, much higher Precision, and a substantially better F1-score. While Tuned LR caught a good proportion of defaulters, its very low Precision would result in a high rate of false alarms.
    • Vs. Tuned Decision Tree (Recall: 0.76, Precision: 0.58, F1: 0.66): The Random Forest with SMOTE has a higher Recall (0.92 vs 0.76), higher Precision (0.86 vs 0.58), and a better F1-score (0.89 vs 0.66). This shows it is better at catching defaulters and makes fewer false positive predictions.
    • Vs. Decision Tree with SMOTE (Recall: 0.91, Precision: 0.67, F1: 0.77): The Random Forest with SMOTE maintains a very high Recall (0.92 vs 0.91) while achieving significantly higher Precision (0.86 vs 0.67), leading to a notably higher F1-score (0.89 vs 0.77). This shows it's slightly better at identifying defaulters and considerably more accurate when it predicts default.
    • Vs. Random Forest without SMOTE (Untuned & Tuned): The Random Forest with SMOTE significantly outperforms both the untuned (Recall: 0.62, Precision: 0.83, F1: 0.71) and tuned (Recall: 0.76, Precision: 0.63, F1: 0.69) Random Forest models without SMOTE, particularly in achieving a much higher Recall while maintaining strong Precision, resulting in a significantly better F1-score. This highlights the critical positive impact of using SMOTE for this model on imbalanced data.
    • Effective Class Imbalance Handling: The use of SMOTE was crucial for boosting the model's ability to handle the imbalanced data and predict the minority class effectively, significantly improving Recall compared to models without SMOTE.
    • Robustness: Random Forests, as ensemble models, are generally robust to noise and outliers and less prone to overfitting compared to single Decision Trees.
  • How it Would Affect the Problem/Business:

    • Mitigation of Financial Losses: By accurately identifying applicants with a high risk of default before approving loans (high Recall), the financial institution can avoid significant financial losses associated with non-performing assets.
    • Improved Risk Assessment: The model provides a more objective, data-driven, and consistent approach to assessing loan default risk compared to potentially subjective manual methods.
    • Increased Operational Efficiency: Automating the risk assessment process for a significant portion of loan applications can reduce the time and resources required for manual review, allowing for faster loan processing.
    • Enhanced Compliance and Fairness: A data-driven model can contribute to a more objective lending process, potentially reducing bias and supporting regulatory compliance requirements, especially when paired with explainability tools.
    • Valuable Business Insights: The feature importance analysis provides clear insights into the key drivers of loan default, which can inform better risk management strategies, loan product design, and targeted interventions.
    • Optimized Decision Making: The model's predictions can empower loan officers and decision-makers with better information to make more informed choices about loan approvals.
    • Cost Reduction: By reducing the instances of bad loans, the bank can protect its profit margins and reduce costs associated with NPAs (Non-Performing Assets).
    • Customer Satisfaction: Faster and fairer processing of loan applications can enhance customer satisfaction and potentially increase customer base.

This structured approach not only addresses the problem but also aligns with the bank’s needs to regulate loan disbursement judiciously while adhering to compliance mandates.

Recommendations for Implementation¶

Implementing the proposed Random Forest model trained with SMOTE requires careful planning and execution to ensure its effectiveness, integrate it into existing workflows, and maximize its benefits.

Key Recommendations to Implement the Solution:

  • Phased Rollout: Implement the model in a phased approach, perhaps starting with a pilot group of loan officers or a specific loan product, to monitor performance and gather feedback before a full-scale deployment.
  • Integration with Existing Systems: Seamlessly integrate the model's prediction engine into the bank's existing loan origination system. This ensures that the model's risk scores are available to decision-makers in a timely and accessible manner.
  • Automated Data Pipeline: Establish an automated data pipeline for collecting, cleaning, preprocessing (including handling missing values and outliers, and applying SMOTE for training data), and feeding new loan application data to the model for scoring.
  • Continuous Monitoring and Validation: Implement a robust system for continuous monitoring of the model's performance in production. This includes tracking key metrics (Recall, Precision, F1-score, AUC-ROC) over time and comparing them to baseline performance. Regular validation of the model's predictions against actual loan outcomes is crucial.
  • Threshold Optimization: Work closely with the risk management team to determine and periodically re-evaluate the optimal prediction probability threshold for classifying a loan as high risk. This threshold should be based on a thorough analysis of the business costs associated with false positives (rejecting good loans) and false negatives (approving defaulting loans).
  • Develop Explainability Tools and Protocols: Create tools (e.g., using SHAP or LIME) to generate explanations for the model's predictions, especially for denied loan applications. Establish clear protocols for how these explanations should be used and communicated to loan officers and applicants (as required by regulations).
  • Establish a Model Governance Framework: Implement a framework for managing the model throughout its lifecycle, including documentation, version control, validation, and retraining procedures.

Key Actionables for Stakeholders:

  • Risk Management:
    • Define the acceptable levels of risk (tolerance for false positives and false negatives).
    • Collaborate on setting and adjusting the prediction threshold.
    • Utilize the model's insights to refine lending policies and risk mitigation strategies.
    • Monitor model performance and provide feedback on its effectiveness in real-world scenarios.
  • Loan Officers:
    • Receive training on how to interpret the model's risk scores and explanations.
    • Understand how the model fits into the overall loan approval process.
    • Provide feedback on the model's predictions and challenging cases.
    • Use the model as a tool to support, not replace, their expert judgment, especially for borderline cases.
  • IT Department:
    • Integrate the model into the existing loan origination system.
    • Build and maintain the automated data pipeline.
    • Develop and support the monitoring and explainability tools.
    • Ensure the security and scalability of the model deployment.
  • Analytics/Data Science Team:
    • Deploy and maintain the model in production.
    • Continuously monitor model performance and identify potential issues.
    • Conduct regular retraining of the model with new data.
    • Explore further model enhancements and address associated problems.
    • Develop and support the explainability tools.

Expected Benefits and Potential Costs:

  • Expected Benefits:
    • Reduced Financial Losses: By more accurately identifying potential defaulters (high Recall), the bank can significantly reduce losses from non-performing loans.
    • Increased Efficiency: Automating the initial risk assessment can speed up the loan approval process, reduce manual workload, and allow loan officers to focus on more complex cases.
    • Improved Risk Consistency: The model provides a consistent and objective risk assessment across all applicants, reducing variability compared to manual reviews.
    • Enhanced Regulatory Compliance: The data-driven and potentially explainable approach supports compliance with fair lending regulations.
    • Valuable Business Insights: The feature importance analysis provides actionable insights into the key drivers of default, informing better business strategies.
    • Potential for Increased Loan Volume: Faster processing may enable the bank to handle a higher volume of loan applications.
  • Potential Costs (Rational Assumptions):
    • Development and Implementation Costs: Initial costs for model development, data pipeline setup, system integration, and infrastructure (e.g., cloud resources). Assume a range based on complexity, e.g., $50,000 to $200,000+.
    • Maintenance and Monitoring Costs: Ongoing costs for monitoring infrastructure, model retraining, and model updates. Assume a monthly cost, e.g., $2,000 to $10,000+.
    • Opportunity Cost of False Positives: Lost revenue from incorrectly rejecting loans from potentially good applicants. This cost is harder to quantify but needs to be considered in threshold optimization.
    • Training Costs: Costs associated with training loan officers and staff on using the new system. Assume a one-time cost per person or per training session.

Key Risks and Challenges:

  • Data Quality Issues: Ongoing data quality issues in the incoming loan application data could negatively impact model performance.
  • Concept Drift: The relationship between input features and loan default may change over time due to economic shifts, changes in lending practices, or applicant demographics, leading to model performance degradation.
  • Model Interpretability: While the Random Forest provides feature importance, explaining individual predictions for complex cases might still be challenging compared to simpler models like Logistic Regression or a limited-depth Decision Tree.
  • Stakeholder Adoption and Trust: Gaining trust and buy-in from loan officers and other stakeholders is crucial. Resistance to using an automated system or lack of understanding of its outputs can hinder successful implementation.
  • Regulatory Scrutiny: Predictive models used in lending are subject to regulatory review. Ensuring fairness, transparency, and the ability to explain decisions is paramount.
  • System Integration Complexity: Integrating the model seamlessly into existing legacy systems can be technically challenging.

What Further Analysis Needs to be Done or What Other Associated Problems Need to Be Solved:

  • Advanced Hyperparameter Tuning: Conduct more extensive hyperparameter tuning for the Random Forest with SMOTE using methods like Bayesian Optimization or more iterations with RandomizedSearchCV to potentially find better performing configurations.
  • Feature Engineering: Explore creating new features from existing ones (e.g., ratios, interaction terms) or incorporating external data sources (e.g., macroeconomic indicators, credit bureau data if available) to potentially improve predictive power.
  • Exploration of Other Algorithms: Evaluate other algorithms well-suited for imbalanced classification, such as Gradient Boosting models (XGBoost, LightGBM, CatBoost) or ensemble methods like BalancedBaggingClassifier.
  • Alternative Imputation Strategies: Investigate more sophisticated imputation techniques (e.g., K-Nearest Neighbors imputation, MICE) and assess their impact on model performance.
  • Cost-Sensitive Learning: Explore cost-sensitive learning approaches that directly incorporate the different costs of false positives and false negatives into the model training process.
  • Fairness and Bias Analysis: Conduct a thorough analysis of model fairness across different demographic groups to identify and mitigate potential biases, ensuring compliance with fair lending regulations.
  • Robustness to Data Distribution Shifts: Develop strategies to detect and handle shifts in the distribution of input features over time (covariate shift) to maintain model performance.
  • Scalability of the Solution: Ensure the proposed solution can handle the anticipated volume of loan applications and scale as the bank's business grows.

Data Description:¶

The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.

  • BAD: 1 = Client defaulted on loan, 0 = loan repaid

  • LOAN: Amount of loan approved.

  • MORTDUE: Amount due on the existing mortgage.

  • VALUE: Current value of the property.

  • REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)

  • JOB: The type of job that loan applicant has such as manager, self, etc.

  • YOJ: Years at present job.

  • DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).

  • DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).

  • CLAGE: Age of the oldest credit line in months.

  • NINQ: Number of recent credit inquiries.

  • CLNO: Number of existing credit lines.

  • DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.

Import the necessary libraries and Data¶

In [11]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Removes the limit for number of displayed columns
pd.set_option("display.max_columns", None)

# Sets limit for number of displayed rows
pd.set_option("display.max_rows", 200)

# Limiting float to 2 decimal
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# To build models for prediction
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Model evaluation metrics
from sklearn import metrics
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    recall_score,
    precision_score,
    accuracy_score,
    f1_score,
    precision_recall_curve,
    make_scorer
)

# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")
In [12]:
# allow import of dataset from google drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Data Overview¶

  • Reading the dataset
  • Understanding the shape of the dataset
  • Checking the data types
  • Checking for missing values
  • Checking for duplicated values
In [13]:
# Loading the dataset
df = pd.read_csv('/content/hmeq.csv')
In [14]:
# copy data to another variable to avoid changes to original data
data = df.copy()
In [15]:
# View first 5 rows
data.head()
Out[15]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
0 1 1100 25860.00 39025.00 HomeImp Other 10.50 0.00 0.00 94.37 1.00 9.00 NaN
1 1 1300 70053.00 68400.00 HomeImp Other 7.00 0.00 2.00 121.83 0.00 14.00 NaN
2 1 1500 13500.00 16700.00 HomeImp Other 4.00 0.00 0.00 149.47 1.00 10.00 NaN
3 1 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0 1700 97800.00 112000.00 HomeImp Office 3.00 0.00 0.00 93.33 0.00 14.00 NaN

Observations:

  • The dataframe has 13 columns corresponding to 13 features, with each row corresponding to a customer.

  • There are missing values (NaN) across different rows and columns.

In [16]:
# View last 5 rows
data.tail()
Out[16]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
5955 0 88900 57264.00 90185.00 DebtCon Other 16.00 0.00 0.00 221.81 0.00 16.00 36.11
5956 0 89000 54576.00 92937.00 DebtCon Other 16.00 0.00 0.00 208.69 0.00 15.00 35.86
5957 0 89200 54045.00 92924.00 DebtCon Other 15.00 0.00 0.00 212.28 0.00 15.00 35.56
5958 0 89800 50370.00 91861.00 DebtCon Other 14.00 0.00 0.00 213.89 0.00 16.00 34.34
5959 0 89900 48811.00 88934.00 DebtCon Other 15.00 0.00 0.00 219.60 0.00 16.00 34.57
In [17]:
# View random sample of 30 rows
data.sample(30, random_state=42)
Out[17]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
1344 0 10600 44696.00 57686.00 DebtCon Office 0.00 0.00 0.00 170.34 0.00 20.00 37.79
625 0 7800 38506.00 50309.00 DebtCon Other 11.00 0.00 0.00 231.00 0.00 32.00 35.91
5908 0 65100 67389.00 142740.00 HomeImp Office 9.00 0.00 1.00 116.91 0.00 11.00 43.37
2991 0 16400 63574.00 88586.00 HomeImp Other NaN 0.00 0.00 298.15 0.00 20.00 29.15
1545 0 11300 NaN 28600.00 HomeImp Office 20.00 0.00 0.00 190.03 0.00 15.00 39.01
1860 0 12400 34055.00 40739.00 DebtCon Other 9.00 NaN 0.00 147.01 6.00 29.00 39.93
4129 1 21600 144276.00 190797.00 DebtCon Mgr 0.00 1.00 4.00 313.43 2.00 18.00 37.97
1643 0 11700 96532.00 107600.00 DebtCon Other 12.00 0.00 0.00 214.87 2.00 15.00 40.29
1374 0 10700 55341.00 78062.00 HomeImp Other 3.00 0.00 1.00 192.60 0.00 12.00 28.83
5919 0 68000 191000.00 288000.00 DebtCon Self 11.00 0.00 0.00 218.10 3.00 25.00 NaN
4319 0 22500 44228.00 86872.00 HomeImp Other 24.00 0.00 0.00 256.29 1.00 6.00 26.85
506 1 7100 39000.00 55000.00 NaN Other 12.00 0.00 1.00 192.77 2.00 31.00 NaN
408 0 6500 55000.00 88300.00 HomeImp Office 29.00 0.00 0.00 234.40 1.00 17.00 NaN
319 0 6000 69876.00 94394.07 HomeImp Other 0.00 0.00 1.00 179.57 0.00 32.00 NaN
3979 0 20800 58440.00 99244.00 DebtCon Office 1.00 0.00 4.00 174.96 0.00 46.00 37.49
1966 1 12800 51800.00 68000.00 HomeImp ProfExe 6.00 NaN NaN NaN NaN NaN NaN
1609 0 11600 28654.00 47042.00 DebtCon Other 10.00 NaN 0.00 158.08 7.00 29.00 38.48
696 0 8100 81322.00 97823.00 DebtCon ProfExe 3.00 0.00 0.00 118.47 0.00 25.00 32.72
5783 1 47000 164411.00 235500.00 DebtCon Office 17.00 0.00 1.00 181.93 3.00 48.00 NaN
3248 0 17300 48132.00 78889.00 DebtCon Mgr 4.00 0.00 0.00 133.17 0.00 8.00 28.30
3475 0 18300 68517.00 97151.00 HomeImp ProfExe 7.00 0.00 0.00 77.52 0.00 15.00 32.00
4498 1 23500 54816.00 87000.00 DebtCon Other 3.50 0.00 1.00 223.20 0.00 17.00 NaN
296 0 5900 70973.00 77989.00 HomeImp Other 7.00 0.00 0.00 120.38 0.00 24.00 41.44
3206 0 17200 89507.00 121047.00 DebtCon Other 14.00 0.00 0.00 118.87 1.00 9.00 34.60
177 0 5000 59900.00 80800.00 DebtCon ProfExe 15.00 0.00 0.00 219.13 1.00 18.00 NaN
3414 0 18000 54196.00 71412.00 DebtCon ProfExe 10.00 0.00 0.00 189.87 1.00 42.00 37.35
5194 0 28000 68000.00 106000.00 HomeImp Other 6.00 NaN NaN 114.70 11.00 21.00 NaN
4301 1 22400 51470.00 68139.00 DebtCon Mgr 9.00 0.00 0.00 31.17 2.00 8.00 37.95
1310 0 10500 64331.00 65040.00 HomeImp Other 4.00 0.00 0.00 66.54 1.00 20.00 42.33
1086 0 9800 49298.00 73426.00 HomeImp Office 9.00 3.00 1.00 242.39 1.00 29.00 34.69

Observations

  • The column BAD is the target variable showing 0 (loan repaid) and 1 (client defaulted). It is categorically binary.

  • There are a number of missing values (NaN) across different columns.

  • The column REASON contains categorical values like 'DebtCon' and 'HomeImp'.

  • The column JOB contains values like 'Office', 'Other', 'Mgr', 'ProfExe', and 'Self'.

  • Some numerical columns like LOAN, MORTDUE, VALUE, CLAGE, and DEBTINC show a range of values.

  • the columns DEROG and DELINQ, representing derogatory reports and delinquent credit lines, mainly show 0.00, but where they are greater than 0.00 show up in particularly in rows where BAD is 1. These could be important indicators of default risk.

In [18]:
# Checking unique values for each feature
data.nunique()
Out[18]:
0
BAD 2
LOAN 540
MORTDUE 5053
VALUE 5381
REASON 2
JOB 6
YOJ 99
DEROG 11
DELINQ 14
CLAGE 5314
NINQ 16
CLNO 62
DEBTINC 4693

In [19]:
# See shape of the data
data.shape
Out[19]:
(5960, 13)

Observations:

  • The dataset has 5960 rows and 13 columns
In [20]:
# Check info of the data
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 605.4+ KB

Observations:

  • The dataset contains 5960 rows (entries) and 13 columns (features) related to client information on credit history, work, home mortgage loans and value.

  • There are 11 numeric features with the columns BAD and LOAN having a int64 datatype and the columns MORTDUE, VALUE, YOJ, DEROG, DELINQ, CLAGE, NINQ, CLNO, DEBTINC having a float64 datatype.

  • There are 2 categorical features with the columns REASON and JOB having a object datatype. These are categorical variables.

  • The column BAD is the target variable, with the value = 0 indicating loan repaid and the value = 1 indicating loan default.

  • The columns BAD and LOAN have no missing values, but the columns MORTDUE, VALUE, REASON, JOB, YOJ, DEROG, DELINQ, CLAGE, NINQ, CLNO, and DEBTINC have non-null values less than the total number of entries. The column DEBTINC has the most missing values, with only 4693 non-null entries. These will have to addressed for model building.

In [21]:
# Use the isnull().sum() function to check percentage of missing values for each column
(data.isnull().sum() / data.shape[0]) * 100
Out[21]:
0
BAD 0.00
LOAN 0.00
MORTDUE 8.69
VALUE 1.88
REASON 4.23
JOB 4.68
YOJ 8.64
DEROG 11.88
DELINQ 9.73
CLAGE 5.17
NINQ 8.56
CLNO 3.72
DEBTINC 21.26

Observations:

  • Columns BAD and LOAN have 0% missing values, which is good because these are important columns.

  • DEBTINC has the highest percentage of missing values at 21.26%. This column represents the debt-to-income ratio and its significant portion of missing data will need to be addressed.

  • DEROG has a relatively high percentage of missing values at 11.88%.

  • DELINQ, MORTDUE, YOJ, NINQ, CLAGE, JOB, REASON, CLNO, and VALUE also have missing values, ranging from 1.88% to 9.73%.

  • The missing values will have to be treated before models can be built.

In [22]:
# Check for duplicates with .duplicated().sum() functions
data.duplicated().sum()
Out[22]:
np.int64(0)

Observations:

  • Data has no duplicated rows.

Summary Statistics¶

In [23]:
# Checking the descriptive statistics
data.describe().T
Out[23]:
count mean std min 25% 50% 75% max
BAD 5960.00 0.20 0.40 0.00 0.00 0.00 0.00 1.00
LOAN 5960.00 18607.97 11207.48 1100.00 11100.00 16300.00 23300.00 89900.00
MORTDUE 5442.00 73760.82 44457.61 2063.00 46276.00 65019.00 91488.00 399550.00
VALUE 5848.00 101776.05 57385.78 8000.00 66075.50 89235.50 119824.25 855909.00
YOJ 5445.00 8.92 7.57 0.00 3.00 7.00 13.00 41.00
DEROG 5252.00 0.25 0.85 0.00 0.00 0.00 0.00 10.00
DELINQ 5380.00 0.45 1.13 0.00 0.00 0.00 0.00 15.00
CLAGE 5652.00 179.77 85.81 0.00 115.12 173.47 231.56 1168.23
NINQ 5450.00 1.19 1.73 0.00 0.00 1.00 2.00 17.00
CLNO 5738.00 21.30 10.14 0.00 15.00 20.00 26.00 71.00
DEBTINC 4693.00 33.78 8.60 0.52 29.14 34.82 39.00 203.31

Observations:

  • The column BAD has a mean of 0.20, which aligns with the data description that had 20% of loans defaulting. This confirms the class imbalance in the target variable.

  • The column LOAN has a mean of 18607.97 and a significant range of values, with a min of 1100 and a max of 899000.

  • The columns MORTDUE and VALUE also show a wide range of values, indicating variability in mortgage amounts and property values.

  • The column YOJ(years present at job) has a mean of 8.92 years, with a min of 0 and a max of 41 years.

  • The columns DEROG and DELINQ have low means and the 75% being 0.00 for both columns suggests most applicants have no major derogatory reports or delinquent credit lines. The maximums are high, indicating the presence of outliers who have multiple negative credit events and could be significant indicators of default risk.

  • The column CLAGE(age of oldest credit line) has a mean of 179 months, with a large range of values.

  • The column NINQ(number of recent credit inquiries) has a mean of 1.19, with a maximum of 17, suggesting some individuals have a very high number of recent inquiries.

  • The column CLNO(number of existing credit lines) has a mean of 21.30, with a maximum of 71, suggesting some individuals have a very high number of credit lines.

  • The column DEBTINC(debt-to-income ratio) has a mean of 33.78. The maximum of 203.31 is very high compared to the 75%, suggesting the presence of significant outliers.

In [24]:
# create a list of categorical columns using .select_dtypes().columns function
cat_cols = data.select_dtypes(include='object').columns
cat_cols = cat_cols.append(pd.Index(['BAD']))
In [25]:
# Printing the $ sub categories of each category
for column in cat_cols:
  print(data[column].value_counts(normalize=True))
  print('*'*40)
REASON
DebtCon   0.69
HomeImp   0.31
Name: proportion, dtype: float64
****************************************
JOB
Other     0.42
ProfExe   0.22
Office    0.17
Mgr       0.14
Self      0.03
Sales     0.02
Name: proportion, dtype: float64
****************************************
BAD
0   0.80
1   0.20
Name: proportion, dtype: float64
****************************************

Observations:

  • For REASON, 'DebtCon (debt consolidation) is the most frequent reason for a loan request, making up 69% of the non-missing entries in this column. 'HomeImp' (home improvement) makes up the other 31%. This indicates that debt consolidation is a significantly higher than home equity loans in this dataset.

  • The column JOB has 6 categories. 'Other' is the most common, making up 42% off non-missing entries. "ProfExe (Professional/Executive) is the next most frequent at 22%, followed by 'Office' at 17%, 'Mgr' (Manager) at 14%, 'Self (self-employed) at 3%, and 'Sales' at 2%. 'Other' could be all the less frequent job types.

  • The column BAD confirms that there is a class imbalance that must be considered for model building, otherwise the model will be bad at identifying the minority class, which is the primary focus of the model.

Exploratory Data Analysis (EDA) and Visualization¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Leading Questions:

  1. What is the range of values for the loan amount variable "LOAN"?
  2. How does the distribution of years at present job "YOJ" vary across the dataset?
  3. How many unique categories are there in the REASON variable?
  4. What is the most common category in the JOB variable?
  5. Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?
  6. Do applicants who default have a significantly different loan amount compared to those who repay their loan?
  7. Is there a correlation between the value of the property and the loan default rate?
  8. Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?

Univariate Analysis¶

Numerical Variables

In [26]:
# Function to plot a boxplot and histogram along same scale to visualize numeric feature distribution
def histogram_boxplot(data, feature, figsize=(12, 7), kde=True, bins=None):
  '''
  data: dataframe
  feature: dataframe column
  figsize: size of figure (default (12,7))
  kde: whether to show the density curve (default False)
  bins: number of bins for histogram (default None)
  '''
  f2, (ax_box2, ax_hist2) = plt.subplots(
      nrows=2,  # Number of rows of the subplot grid= 2
      sharex=True,  # x-axis will be shared among all subplots
      gridspec_kw={"height_ratios": (0.25, 0.75)},
      figsize=figsize,  # Adjust to your preference
  )
  sns.boxplot(
      data=data, x=feature, ax=ax_box2, showmeans=True, color="blue"
  )  # Boxplot will be created and a star will indicate the mean value
  if bins==None:
    sns.histplot(
      data=data, x=feature, kde=kde, ax=ax_hist2, bins=30, color="blue",
    )
  else:
    sns.histplot(
      data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, color="blue",
    ) # for histogram
  ax_hist2.axvline(
      data[feature].mean(), color="green", linestyle="--"
  )  # Add mean to the histogram
  ax_hist2.axvline(
      data[feature].median(), color="black", linestyle="-"
  )  # Add median to the histogram
In [27]:
# histogram and boxplot for visualizing distribution of 'LOAN' numerical feature
histogram_boxplot(data, 'LOAN')
No description has been provided for this image

Observations:

  • The 'LOAN' histogram shows a peak around 15000. There is a right-skewed distribution for loan amounts, which is shown by the mean > median and a long tail on the right side. This indicates that while the majority of loans concentrate at the lower end around 2000 to 30000, there are a number of outliers present of loans of very high amounts that go up to 89900.
  • The boxplot confirms the right skewed distribution, with the box (IQR) toward the lower amount values and the mean > median.There are many points beyond the upper whisker of the boxplot. These are outliers, which confirm the presence of a sizeable number of loans with very high amounts.

1. What is the range of values for the loan amount variable "LOAN"?

  • Based on the summary statistics, the 'LOAN' column has a range from 1100 to 89900. The plots above visually confirm this.
In [28]:
# histogram and boxplot for visualizing distribution of 'MORTDUE' numerical feature
histogram_boxplot(data, 'MORTDUE')
No description has been provided for this image

Observations:

  • The 'MORTDUE' histogram shows a right skewed distribution, which is similar to 'LOAN' column with its mean greater than its median. Most values are concentrated at the lower end of mortgage amounts below 150000, but the presence of a long tail on the right side indicates significant outliers on the upper end.
  • The boxplot confirms this right-skewed distribution, with its median slightly off center to the left and its upper whisker being longer. There are many points beyond the upper whisker of the boxplot, which are outliers that indicate that while most applicants have lower mortgage amounts, there a substantial number with much higher mortgage debts.
In [29]:
# histogram and boxplot for visualizing distribution of 'VALUE' numerical feature
histogram_boxplot(data, 'VALUE')
No description has been provided for this image

Observations:

  • The 'Value' histogram shows a right-skewed distribution for property values with its mean greater than its median. The majority of values are concentrated at lower values below 200000, but the presence ofa long tail on the right side indicates significant outliers going up to 855000.
  • The boxplot confirms the right-skewed distribution, with many outliers beyond the upper whisker indicating a number of properties with comparably much higher values.

How does the distribution of years at present job "YOJ" vary across the dataset?

In [30]:
# histogram and boxplot for visualizing distribution of 'YOJ' numerical feature
histogram_boxplot(data, 'YOJ')
No description has been provided for this image

Observations:

  • The 'YOJ' histogram shows a right-skewed distribution for years at current job. Most have been at their current job for shorter periods of time under 20 years, with much fewer having stayed at their jobs longer than that
  • The boxplot confirms this right-skewed distribution, with higher end outliers indicating some have been at their job much longer than the rest.
In [31]:
# histogram and boxplot for visualizing distribution of 'DEROG' numerical feature
histogram_boxplot(data, 'DEROG')
No description has been provided for this image

Observations:

  • The 'DEROG' histogram shows that the large majority of applicants have no major derogatory reports. There are a few that have a small amount, and even fewer with a higher amount that contribute to the strong right skew.
  • The boxplot confirms that most are concentrated at 0, with a number of outliers that indicate a few individuals with a high number of derogatory reports.
In [32]:
# histogram and boxplot for visualizing distribution of 'DELINQ' numerical feature
histogram_boxplot(data, 'DELINQ')
No description has been provided for this image

Observations:

  • The 'DELINQ' histogram shows that the large majority of applicants have no delinquent credit lines. There are a few that have a small amount, and even fewer with a higher amount that contribute to the strong right skew. It is similar to the 'DEROG' column.
  • The boxplot confirms that most are concentrated at 0, with a number of outliers that indicate a few individuals with a high number of delinquent credit lines.
In [33]:
# histogram and boxplot for visualizing distribution of 'CLAGE' numerical feature
histogram_boxplot(data, 'CLAGE')
No description has been provided for this image

Observations:

  • The 'CLAGE' histogram shows a right-skewed distribution, although it is less than some of the previous variables. There is a peak around 150-200 months, indicating where the age of oldest credit line is for most applicants.
  • The boxplot shows a slight skew with the mean > median. There are a number of higher end outliers, indicating some applicants with a much older credit history.
In [34]:
# histogram and boxplot for visualizing distribution of 'NINQ' numerical feature
histogram_boxplot(data, 'NINQ')
No description has been provided for this image

Observations:

  • The 'NINQ' histogram shows a right-skewed distribution, with most applicants have 0 to 3 recent credit inquiries. The number drops sharply with the long tail to the right, indicating outliers that go up to 17 inquiries.
  • The boxplot confirms this, with outliers indicating a few applicants with a high number of recent credit inquiries.
In [35]:
# histogram and boxplot for visualizing distribution of 'CLNO' numerical feature
histogram_boxplot(data, 'CLNO')
No description has been provided for this image

Observations:

  • The 'CLNO' histogram shows a generally normal distribution for the number of existing credit lines, with a peak around 20-25.

  • The boxplot confirms the generally normal distribution around the mean and median.

  • There are some outliers on the higher end, indicating a few individuals with a very larger number of credit lines, going all the way up to 71 credit lines.

In [36]:
# histogram and boxplot for visualizing distribution of 'DEBTINC' numerical feature
histogram_boxplot(data, 'DEBTINC', bins=60)
No description has been provided for this image

Observations:

  • The 'DEBTINC' histogram shows a right-skewed distribution for the debt-to-income ratio. There is a peak around 30-40, with most applicants have a debt-to-income ratio between 20% to 50%.
  • The boxplot shows this right-skewed distribution, with a substantial number of outliers at the higher end, going as high as 203%. This indicates some individuals with a very high debt-to-income ratio, which aligns with what was seen in the summary statistics where the maximum value was much higher than the 75th percentile.

Categorical Variables

In [37]:
# create a function for a labeled barplot to visualize distributions of categorical features
def labeled_barplot(data, feature, perc = False, n = None):
  """
  Barplot with percentage at the top

  data: dataframe
  feature: dataframe column
  perc: whether to display percentages instead of count (default is False)
  n: displays the top n category levels (default is None, i.e., display all levels)
  """

  total = len(data[feature])  # Length of the column
  count = data[feature].nunique()
  if n is None:
    plt.figure(figsize = (count + 1, 5))
  else:
    plt.figure(figsize = (n + 1, 5))

  plt.xticks(rotation = 90, fontsize = 15)
  ax = sns.countplot(
    data = data,
    x = feature,
    palette = "Paired",
    order = data[feature].value_counts().index[:n].sort_values(),
  )

  for p in ax.patches:
    if perc == True:
      label = "{:.1f}%".format(
          100 * p.get_height() / total
      )                       # Percentage of each class of the category
    else:
      label = p.get_height()  # Count of each level of the category

    x = p.get_x() + p.get_width() / 2  # Width of the plot
    y = p.get_height()  #height of the plot
    ax.annotate(
        label,
        (x, y),
        ha = "center",
        va = "center",
        size = 12,
        xytext = (0, 5),
        textcoords = "offset points",
    ) # annotate the percentage

  plt.show()  # Show the plot
In [38]:
# labeled barplot for 'BAD' feature
labeled_barplot(data, 'BAD', perc=True)
No description has been provided for this image

Observations:

  • The 'BAD' barplot shows it has 2 unique subcategories(0 = loan repaid, 1 = client defaulted on loan).
  • This confirms the significant class imbalance, with the majority having repaid their loan(80.1%). This is an important consideration for model building to avoid bias towards the majority class.

How many unique categories are there in the REASON variable?

In [39]:
# labeled barplot for 'REASON' feature
labeled_barplot(data, 'REASON', perc=True)
No description has been provided for this image

Observations:

  • The 'REASON' barplot shows there are 2 unique subcategories: 'DebtCon' (debt consolidation) and 'HomeImp' (home improvement).

  • The majority reason for the loan request is 'DebtCon' (65.9%) of non-missing entries, with the rest being 'HomeImp' (29.9%). This indicates that debt consolidation is much more common compared to home improvement.

  • The total doesn't sum up to 100% due to missing entries.

What is the most common category in the JOB variable?

In [40]:
# labeled barplot for 'JOB' feature
labeled_barplot(data, 'JOB', perc=True)
No description has been provided for this image

Observations:

  • The 'JOB' barplot shows there are 6 unique subcategories: 'Other', 'ProfExe' (Professional/Executive), 'Office', 'Mgr' (Manager), 'Self' (Self-employed), and 'Sales'.

  • 'Other' is the most common job category(40.1%), followed by 'ProfExe(15.9%), 'Mgr'(12.9%), 'Self'(3.2%), and 'Sales'(1.8%).

  • 'Other' is likely grouping all other less frequent job types.

  • The total doesn't sum up to 100% due to missing entries.

Bivariate Analysis¶

Continuous vs. Continuous Variables

In [41]:
# Scatterplot of MORTDUE vs. LOAN
plt.figure(figsize=(10, 6))
sns.scatterplot(x='MORTDUE', y='LOAN', data=data)
plt.title('MORTDUE vs. LOAN')
plt.xlabel('Amount Due on Existing Mortgage (MORTDUE)')
plt.ylabel('Amount of Loan Approved (LOAN)')
plt.show()
No description has been provided for this image

Observations:

  • The scatterplot of MORTDUE vs. LOAN shows a positive correlation between the amount due on the existing mortgage and the amount of the approved loan. As the existing mortgage amount increases, the approved loan amount tends to increase as well. The plot shows a clustering of data points at lower values for both variables, with outliers extending towards higher amounts.
In [42]:
# Scatterplot of VALUE vs. LOAN
plt.figure(figsize=(10, 6))
sns.scatterplot(x='VALUE', y='LOAN', data=data)
plt.title('VALUE vs. LOAN')
plt.xlabel('Current Value of the Property (VALUE)')
plt.ylabel('Amount of Loan Approved (LOAN)')
plt.show()
No description has been provided for this image

Observations:

  • The scatterplot of VALUE vs. LOAN shows a positive correlation between the current value of the property and the amount of the loan approved. As the current value of the property increases, the amount of the loan approved also tends to increase. Similar to other plots, many data points are concentrated at the lower end of both variables, with outliers at the higher end.
In [43]:
# Scatterplot of DEBTINC vs. LOAN
plt.figure(figsize=(10, 6))
sns.scatterplot(x='DEBTINC', y='LOAN', data=data)
plt.title('DEBTINC vs. LOAN')
plt.xlabel('Debt-to-Income Ratio (DEBTINC)')
plt.ylabel('Amount of Loan Approved (LOAN)')
plt.show()
No description has been provided for this image

Observations:

  • The scatterplot of DEBTINC vs. LOAN shows that there isn't a strong linear correlation between the debt-to-income ratio and the loan amount. Most data points are clustered within a certain range of DEBTINC, with outliers at very high DEBTINC values. The relationship between these two variables appears complex and not simply linear.
In [44]:
# Scatterplot of CLAGE vs. LOAN
plt.figure(figsize=(10, 6))
sns.scatterplot(x='CLAGE', y='LOAN', data=data)
plt.title('CLAGE vs. LOAN')
plt.xlabel('Age of the Oldest Credit Line in Months (CLAGE)')
plt.ylabel('Amount of Loan Approved (LOAN)')
plt.show()
No description has been provided for this image

Observations:

  • The scatterplot of CLAGE vs. LOAN shows a slight upward slope, suggesting a weak positive correlation, where as the age of the oldest credit line increases, the approved loan amount tends to slightly increase. The relationship between CLAGE and LOAN isn't strong enough alone.
In [45]:
# Scatterplot of DEBTINC vs NINQ
plt.figure(figsize=(10, 6))
sns.scatterplot(x='DEBTINC', y='NINQ', data=data)
plt.title('DEBTINC vs NINQ')
plt.xlabel('Debt-to-Income Ratio (DEBTINC)')
plt.ylabel('Number of Recent Credit Inquiries (NINQ)')
plt.show()
No description has been provided for this image

Observations:

  • The scatterplot of DEBTINC vs. NINQ shows a very weak positive correlation. While there's a slight tendency for the debt-to-income ratio to increase with the number of recent credit inquiries, the relationship is not strong, and the data points are widely scattered.

Categorical variables vs Continuous cariables

In [46]:
# create a boxplot for each numerical feature in relation to 'REASON' categorical feature
# list of numerical features
numerical_features = ['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']

# Create a figure with a 5x2 grid of subplots
fig, axes = plt.subplots(5, 2, figsize=(20, 25)) # Increased figsize
axes = axes.flatten() # Flatten the 2D array of axes for easier iteration

# iterate through numerical features to create boxplot
for i, feature in enumerate(numerical_features):
    sns.boxplot(data=data, x=feature, y='REASON', ax=axes[i], hue='REASON', palette='rainbow')
    axes[i].set_title(f'Boxplot of {feature} vs. REASON')
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('REASON')
    axes[i].get_legend().remove() # Remove the legend from each subplot


plt.tight_layout()
plt.show()
No description has been provided for this image

Observations:

  • LOAN vs. REASON : Applicants who took out loans for 'DebtCon' tend to have higher loan amounts compared to those who took out loans for 'HomeImp'.

  • MORTDUE vs. REASON: Similar to LOAN, the mortgage amount appears higher for applicants with the reason 'DebtCon' compared to 'HomeImp'.

  • VALUE vs. Reason: The property values appear slightly higher for 'DebtCon' compared to 'HomeImp'.

  • YOJ vs. REASON: The median number of years at the current job appears to be slightly lower for applicants with the reason 'DebtCon' compared to 'HomeImp'.

  • DEROG vs. REASON: There doesn't seem to be a significant visual difference in the distribution of derogatory reports between the two reasons. Both show a large concentration at zero, with some outliers.

  • DELINQ vs. REASON: Similar to DEROG, the number of delinquent credit lines does not show a clear difference between 'DebtCon' and 'HomeImp' based on these boxplots.

  • CLAGE vs. REASON: The age of the oldest credit line appears similar for the two categories.

  • NINQ vs. REASON: The number of recent credit inquiries appear higher for 'DebtCon' compared to 'HomeImp'.

  • CLNO vs. REASON: The number of existing credit lines appears to be higher for applicants with the reason 'DebtCon' compared to 'HomeImp'.

  • DEBTINC vs. REASON: The debt-to-income ratio seems to be slightly higher for 'DebtCon' compared to 'HomeImp'.

  • Applicants who seek loans for debt consolidation (DebtCon) tend to have higher loan amounts (LOAN), number of existing credit lines (CLNO), and debt-to-income ratios (DEBTINC) compared to those seeking loans for home improvement (HomeImp). They have a slightly lower median number of years at their current job (YOJ). There is a similar distribution for the number of major derogatory reports (DEROG) and delinquent credit lines (DELINQ), as well as age of the oldest credit line (CLAGE) compared to those seeking Home Improvement loans.

  • These plots highlight that the applicants using loans for debt consolidation exhibit characteristics associated with higher financial leverage and activity compared to those seeking home improvement loans.

Do applicants who default have a significantly different loan amount compared to those who repay their loan?

Is there a correlation between the value of the property and the loan default rate?

Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?

In [47]:
# Bivariate analysis of 'LOAN', 'MORTDUE', 'VALUE' numerical features vs 'BAD' with histograms and boxplots

numerical_features = ['LOAN', 'MORTDUE', 'VALUE']

fig, axes = plt.subplots(len(numerical_features), 3, figsize=(20, 20))

for i, feature in enumerate(numerical_features):
    # Boxplot
    sns.boxplot(data=data, x='BAD', y=feature, ax=axes[i, 0], hue='BAD')
    axes[i, 0].set_title(f'Boxplot of {feature} vs. BAD')

    # Histogram for not defaulted
    hist_0 = sns.histplot(data=data[data['BAD']==0], x=feature, ax=axes[i, 1], bins=20, kde=True, color='blue')
    axes[i, 1].set_title(f'Histogram of {feature} (Not Defaulted)')
    # Add percentages to bars for BAD=0
    total_0 = len(data[data['BAD']==0].dropna(subset=[feature]))
    for patch in hist_0.patches:
        height = patch.get_height()
        if height > 0:
            percentage = (height / total_0) * 100
            axes[i, 1].text(patch.get_x() + patch.get_width()/2., height, f'{percentage:.1f}%', ha='center', va='bottom', fontsize=8)


    # Histogram for defaulted
    hist_1 = sns.histplot(data=data[data['BAD']==1], x=feature, ax=axes[i, 2], bins=20, kde=True, color='orange')
    axes[i, 2].set_title(f'Histogram of {feature} (Defaulted)')
    # Add percentages to bars for BAD=1
    total_1 = len(data[data['BAD']==1].dropna(subset=[feature]))
    for patch in hist_1.patches:
        height = patch.get_height()
        if height > 0:
            percentage = (height / total_1) * 100
            axes[i, 2].text(patch.get_x() + patch.get_width()/2., height, f'{percentage:.1f}%', ha='center', va='bottom', fontsize=8)


plt.tight_layout()
plt.show()
No description has been provided for this image

Observations:

  • LOAN vs. BAD: Applicants who default do appear to have a significantly different loan amount compared to those who repay their loan. The boxplot shows that the median loan amount for defaulted loans is lower than the median loan amount for non-defaulted loans. The spread is wider for non-defaulted loans. The histograms percentages show that the distribution of loan amounts for defaulted loans is proportionally more concentrated at lower loan values compared to the distrbution for non-defaulted loans. This suggests that lower loan amounts are associated with a higher likelihood of default.

  • MORTDUE vs. BAD: Applicants who default do appear to have a significantly different mortgage amount due compared to those who repay their loan. The boxplot shows that the median for amount due on existing mortgages for defaulted applicants is lower than the median for non-defaulted applicantss. The histograms percentages show that the distribution for defaulted loans is proportionally more concentrated at lower values compared to the distribution for non-defaulted loans. This suggests that, similar to loan amount, lower amounts due on existing mortgages are associated with a higher likelihood of default. Applicants with higher mortgages may be better financially.

  • VALUE vs. BAD: There does appear to be a correlaton between the value of the property and the loan default rate. The boxplot shows that the median property value for defaulted applicants is lower than the median property value for non-defaulted applicants. The histograms percentages also indicate that the distribution of property values for defaulted loans is proportionally more concentrated at lower values compared to the distribution for non-defaulted loans. This suggests that lower property values tend to be associated with a higher loan default rate. A lower property value might mean less equity for the borrower and less collateral for the lender, potentially increasing the risk of default.

In [48]:
# Bivariate analysis of 'YOJ','CLAGE', 'NINQ', 'CLNO' numerical features vs 'BAD' with histograms and boxplots

numerical_features = ['YOJ','CLAGE', 'NINQ', 'CLNO']

fig, axes = plt.subplots(len(numerical_features), 3, figsize=(20, 20))

for i, feature in enumerate(numerical_features):
    # Boxplot
    sns.boxplot(data=data, x='BAD', y=feature, ax=axes[i, 0], hue='BAD')
    axes[i, 0].set_title(f'Boxplot of {feature} vs. BAD')

    # Histogram for not defaulted
    hist_0 = sns.histplot(data=data[data['BAD']==0], x=feature, ax=axes[i, 1], bins=20, kde=True, color='blue')
    axes[i, 1].set_title(f'Histogram of {feature} (Not Defaulted)')
    # Add percentages to bars for BAD=0
    total_0 = len(data[data['BAD']==0].dropna(subset=[feature]))
    for patch in hist_0.patches:
        height = patch.get_height()
        if height > 0:
            percentage = (height / total_0) * 100
            axes[i, 1].text(patch.get_x() + patch.get_width()/2., height, f'{percentage:.1f}%', ha='center', va='bottom', fontsize=8)


    # Histogram for defaulted
    hist_1 = sns.histplot(data=data[data['BAD']==1], x=feature, ax=axes[i, 2], bins=20, kde=True, color='orange')
    axes[i, 2].set_title(f'Histogram of {feature} (Defaulted)')
    # Add percentages to bars for BAD=1
    total_1 = len(data[data['BAD']==1].dropna(subset=[feature]))
    for patch in hist_1.patches:
        height = patch.get_height()
        if height > 0:
            percentage = (height / total_1) * 100
            axes[i, 2].text(patch.get_x() + patch.get_width()/2., height, f'{percentage:.1f}%', ha='center', va='bottom', fontsize=8)


plt.tight_layout()
plt.show()
No description has been provided for this image

Observations:

  • YOJ vs. BAD: The boxplot shows that the median years at present job for defaulted loans is slightly lower than the median for non-defaulted loans. The histograms are both skewed right, but the percentages show that a higher proportion of non-defaulted applicants have been at their job for longer periods periods of time compared to defaulted applicants.

  • CLAGE vs. BAD: The boxplot shows that the median age of the oldest credit line for defaulted loans is lower than the median for non-defaulted loans. The histograms are both skewed right, but the percentages show a larger proportion of defaulted applicants have a lower oldest credit line compared to non-default applicants.

  • NINQ vs. BAD: The boxplot shows defaulted loans having a higher upper quartile than non-defaulted loans. The histograms both skew right, but percentages show a higher number of non-defaulted applicants have 0 recent credit inquiries compared to the defaulted applicants. The wider spread suggests there is a higher risk with more inquiries. Applicants with more recent credit inquiries could be seeking more credit due to financial instability, which is a likely indicator for default risk.

  • CLNO vs. BAD: The boxplot shows similar median existing credit lines between defaulted and non-defaulted loans, but the defaulted applicants have a wider box. The histograms are somewhat bell-shaped with percentages showing similar distributions of existing credit lines.

In [49]:
# Bivariate analysis of 'DEROG','DELINQ', 'DEBTINC' numerical features vs BAD with histograms and boxplots

numerical_features = ['DEROG','DELINQ', 'DEBTINC']

fig, axes = plt.subplots(len(numerical_features), 3, figsize=(20, 20))

for i, feature in enumerate(numerical_features):
    # Boxplot
    sns.boxplot(data=data, x='BAD', y=feature, ax=axes[i, 0], hue='BAD')
    axes[i, 0].set_title(f'Boxplot of {feature} vs. BAD')

    # Histogram for not defaulted
    hist_0 = sns.histplot(data=data[data['BAD']==0], x=feature, ax=axes[i, 1], bins=20, kde=True, color='blue')
    axes[i, 1].set_title(f'Histogram of {feature} (Not Defaulted)')
    # Add percentages to bars for BAD=0
    total_0 = len(data[data['BAD']==0].dropna(subset=[feature]))
    for patch in hist_0.patches:
        height = patch.get_height()
        if height > 0:
            percentage = (height / total_0) * 100
            axes[i, 1].text(patch.get_x() + patch.get_width()/2., height, f'{percentage:.1f}%', ha='center', va='bottom', fontsize=8)


    # Histogram for defaulted
    hist_1 = sns.histplot(data=data[data['BAD']==1], x=feature, ax=axes[i, 2], bins=20, kde=True, color='orange')
    axes[i, 2].set_title(f'Histogram of {feature} (Defaulted)')
    # Add percentages to bars for BAD=1
    total_1 = len(data[data['BAD']==1].dropna(subset=[feature]))
    for patch in hist_1.patches:
        height = patch.get_height()
        if height > 0:
            percentage = (height / total_1) * 100
            axes[i, 2].text(patch.get_x() + patch.get_width()/2., height, f'{percentage:.1f}%', ha='center', va='bottom', fontsize=8)


plt.tight_layout()
plt.show()
No description has been provided for this image

Observations:

  • DEROG vs. BAD: The boxplot shows the median for DEROG is 0 for both groups, but the defaulted group has higher values and outliers. The histogram for non-defaulted loans shows that a very high percentage (90.9%) of non-defaulted applicants have 0 derogatory reports. For defaulted loans, it has a lower percentage of 0 derogatory reports (68.4%), and a much larger percentage of defaulted applicants have 1 or more derogatory reports compared to the non-defaulted group. The percentages on the bars for DEROG > 0 in the defaulted histogram are significantly higher than the corresponding percentages in the non-defaulted histogram.

  • DELINQ vs. BAD: The boxplot shows the median for DELINQ is 0 for both groups, but the defaulted group has higher values and outliers. The histogram for non-defaulted loans shows that a very high percentage (84.4%) of non-defaulted applicants have 0 delinquent credit lines. For defaulted loans, it has a lower percentage of 0 delinquent credit lines (52.2%), and a much larger percentage of defaulted applicants have 1 or more delinquent credit lines compared to the non-defaulted group. The percentages on the bars for DELINQ > 0 in the defaulted histogram are significantly higher than the corresponding percentages in the non-defaulted histogram.

  • DEBTINC vs. BAD: The boxplot shows a significantly higher median DEBTINC for defaulted loans compared to non-defaulted loans, with a wider spread and more high outliers in the defaulted group. The histogram for non-defaulted loans shows that the largest percentages are concentrated at lower DEBTINC values compared to defaulted loans, which shows a larger percentage concentrated at higher DEBTINC values. Higher debt-to-income ratios appears to be an indicator of higher risk of default, with borrowers with DEBTINC values above 40% significantly more likely to default.

Categorical vs Categorical Variable

In [50]:
# Function to plot stacked bar plots

def stacked_barplot(data, x, y, perc=False, palette='pastel'):
  """
  Print the category counts and plot a stacked bar chart

  data: dataframe
  x: independent variable (categorical)
  y: target variable (categorical)
  perc: whether to display percentages on the bars (default is False)
  palette: color palette to use for the plot (default is 'viridis')
  """
  count = data[x].nunique()
  sorter = data[y].value_counts().index[-1]
  tab1 = pd.crosstab(data[x], data[y], margins = True).sort_values(
      by = sorter, ascending = False
  )
  print(tab1)
  print('-' * 120)
  tab = pd.crosstab(data[x], data[y]).sort_values(
      by = sorter, ascending = False
  )
  ax = tab.plot(kind = "bar", stacked = True, figsize = (count + 1, 5), color=sns.color_palette(palette))
  plt.title(f"Stacked Bar Chart of {x} vs. {y}")
  plt.xlabel(x)
  plt.ylabel("Count")
  plt.legend(loc = "lower left", frameon = False)
  plt.legend(loc = "upper left", bbox_to_anchor = (1, 1))

  if perc:
        for p in ax.patches:
            width, height = p.get_width(), p.get_height()
            x_pos, y_pos = p.get_x(), p.get_y()
            total_height = 0
            # Find the total height of the stacked bar
            for patch in ax.patches:
                if patch.get_x() == x_pos and patch.get_width() == width:
                    total_height += patch.get_height()

            if height > 0:
                percentage = (height / total_height) * 100 if total_height > 0 else 0
                ax.text(x_pos + width/2., y_pos + height/2., '{:.1f}%'.format(percentage), ha='center', va='center')


  plt.show()  # Show the plot
In [51]:
# stacked barplot of relationship between 'JOB' and 'REASON' with percentages within bar
stacked_barplot(data, 'JOB', 'REASON', perc=True)
REASON   DebtCon  HomeImp   All
JOB                            
All         3813     1723  5536
Other       1604      716  2320
ProfExe      847      405  1252
Office       620      301   921
Mgr          572      174   746
Self          73      115   188
Sales         97       12   109
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations:

  • The plot shows that for most jov categories, 'DebtCon' is the more frequent reason for a loan request compared to 'HomeImp'.

  • For 'Other', 'ProfExe', 'Office', and 'Mgr', debt consolidation makes up the majority of loan reasons (ranging from around 67% to 77%). For 'Self', home improvement is the more frequent reason for loan application (60%), which is noticeably different from the other categories. For 'Sales', debt consolidation is the overwhelming majority (89%).

  • This reveals that the reason for a loan request is not uniformilly distributed across all job types, which suggests that job type is related to the reason for seeking a loan and could be linked to different levels of financial stress or goals. This could potentiall influence the risk of default.

Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?

In [52]:
# stacked barplot of relationship between 'REASON' and 'BAD' with percentages in bar
stacked_barplot(data, 'REASON', 'BAD', perc=True)
BAD         0     1   All
REASON                   
All      4567  1141  5708
DebtCon  3183   745  3928
HomeImp  1384   396  1780
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations:

  • Overall, proportionally more applicants are non-defaulted and have 'DebtCon' as the reason for loan application compared to 'HomeImp'.
  • The plot suggests that loans for home improvement might have a slightly higher default rate (22.3%) compared to loans taken from debt consolidation (19%), but it is not a very significantly large difference. It is interesting because it seems intuitive that debt consolidation loans would be riskier.
In [53]:
# stacked barplot of relationship between 'JOB' and 'BAD' with percentages within bar
stacked_barplot(data, 'JOB', 'BAD', perc=True)
BAD         0     1   All
JOB                      
All      4515  1166  5681
Other    1834   554  2388
ProfExe  1064   212  1276
Mgr       588   179   767
Office    823   125   948
Self      135    58   193
Sales      71    38   109
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations:

  • The plot shows again the class imbalance between non-defaulted and defaulted loans. The percentage of defaulted loans varies among the different job categories. 'Sales' has the highest percentage of defaulted loans (34.9%), followed by 'Self' (30.1%), 'Mgr' (23.3%), 'Other' (23.2%), and 'Office' (13.2%). This suggests that job type could be a relevant factor in predicting loan default.

Multivariate Analysis¶

In [54]:
# list of numerical features
numerical_col = ['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']

# Compute the correlation matrix for numerical columns
correlation_matrix = data[numerical_col].corr()

# Display the correlation matrix as a heatmap with annotations
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", vmin=-1, vmax=1)
plt.title('Correlation Matrix of Numerical Variables')
plt.show()
No description has been provided for this image

Observations:

  • Correlation with BAD (target variable):

    • BAD has the strongest positive correlation with DELINQ (0.35), suggesting that the number of delinquent credit lines is a significant indicator for loan default.
    • Its followed by DEROG (0.28), which points to major derogatory reports being a key factor in default risk.
    • This is followed by DEBTINC (0.20), which points to higher debt-to-income ratios being related to a higher default risk.
    • Finally NINQ (0.17), which could indicate that number of recent credit inquiries could point towards higher risk of default.
    • There is a negative correlation with CLAGE (0.17), which couuld indicate that older credit lines are associated with lower default risk.
    • BAD has weak correlations with LOAN, MORTDUE, VALUE, and CLNO, suggesting that a simple linear relationship between these features and loan default is not strong.
  • There are strong positive correlations between MORTDUE and VALUE (0.88), which is expected as the amount due on a mortgage is highly related to the property value. This might need to be treated for multicollinearity.

  • DEROG and DELINQ have a moderate positive correlation (0.21), which is expected to go hand in hand.

  • There is a moderate positive correlation between MORTDUE and CLNO (0.32), and between VALUE and CLNO (0.27), which suggests that applicants with higher existing mortgages and property values might tend to have more credit lines. It also has a positive correlation with CLAGE (0.24), which makes sense that older credit lines would mean having more credit lines.

  • There is a moderate correlation of CLAGE with YOJ (0.20).

In [55]:
# Visualizing relation of property and loan features with BAD
# Select numerical features for the pair plot
numerical_features = ['LOAN', 'MORTDUE', 'VALUE', 'BAD'] # Include BAD for hue

# Create a pair plot
plt.figure(figsize=(20, 15))
sns.pairplot(data[numerical_features], hue='BAD', palette='bright')
plt.suptitle('Pair Plot of Numerical Features by Loan Status (BAD)', y=1.02) # Add a title
plt.show();
<Figure size 2000x1500 with 0 Axes>
No description has been provided for this image

Observations:

  • MORTDUE vs. VALUE, MORTDUE vs. LOAN, and VALUE vs. LOAN plots show positive correlations, indicating that higher property values are associated with higher mortgages and loans. The blue non-defaulted points are more spread out towards higher LOAN and MORTDUE values compared to the orange default points concentrated at lower values.

  • The defaulters appear at higher LOAN values when compared to their VALUE, indicating possible higher leverage.

  • For smaller LOAN amounts, there is a relatively higher proportion of defaulters compared to larger LOAN amounts.

In [56]:
# Visualizing relation of credit-related features with BAD
# Select numerical features for the pair plot
numerical_features = [ 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO','DEBTINC', 'BAD'] # Include BAD for hue

# Create a pair plot
plt.figure(figsize=(20, 15))
sns.pairplot(data[numerical_features], hue='BAD', palette='bright')
plt.suptitle('Pair Plot of Numerical Features by Loan Status (BAD)', y=1.02) # Add a title
plt.show();
<Figure size 2000x1500 with 0 Axes>
No description has been provided for this image

Observations:

  • This pairplot strongly emphasizes the role of DEROG, DELINQ, and DEBTING as indicators of loan default risk. It also shows that defaulted applicants tend to have slightly lower values of CLAGE and CLNO and higher values of NINQ.

  • For DEROG and DELINQ, the histogram shows orange defaulted distribution appear significantly more spread out to the higher values of the right, with blue non-defaulted points heavily concentrated at 0. This confirms that defaulted applicants are much more likely to have derogatory reports and delinquent credit lines. The scatterplot also have orange defaulted points much more prevalent at higher values compared to blue non-defaulted points.

  • For CLAGE and CLNO, the distributions for orange defaulted points appear shifted towards lower values compared to the blue non-defaulted points.

  • For NINQ, the orange defaulted distribution shows relatively higher frequencies at values greater than 0 compared to the blue non-defaulted distribution.

  • For DEBTINC, the orange defaulted distribution is shifted towards higher values compared to the blue non-defaulted distribution. The scatterplots show orange defaulted points tending to appear at higher DEBTINC values.

Treating Outliers¶

  • BAD won't be treated for outliers since it is the target variable with binary categories(0 an 1).

  • LOAN won't be treated because its values reflect legitimate variability in loan sizes.

  • MORTDUE, VALUE, YOJ, CLNO, DEBTINC will clip outliers to IQR whisker limits that could skew the modeling results.

  • DEROG, DELINQ and NINQ have more extreme skew from outlier values and will have the values capped at the 95th percentile.

In [57]:
# create copy of dataframe for outlier treatment
data_clean = data.copy()
In [58]:
# selecting numerical columns
num_cols = df.select_dtypes('number').columns

# checking summary statistics
sum_stat = df[num_cols].describe().T
sum_stat
Out[58]:
count mean std min 25% 50% 75% max
BAD 5960.00 0.20 0.40 0.00 0.00 0.00 0.00 1.00
LOAN 5960.00 18607.97 11207.48 1100.00 11100.00 16300.00 23300.00 89900.00
MORTDUE 5442.00 73760.82 44457.61 2063.00 46276.00 65019.00 91488.00 399550.00
VALUE 5848.00 101776.05 57385.78 8000.00 66075.50 89235.50 119824.25 855909.00
YOJ 5445.00 8.92 7.57 0.00 3.00 7.00 13.00 41.00
DEROG 5252.00 0.25 0.85 0.00 0.00 0.00 0.00 10.00
DELINQ 5380.00 0.45 1.13 0.00 0.00 0.00 0.00 15.00
CLAGE 5652.00 179.77 85.81 0.00 115.12 173.47 231.56 1168.23
NINQ 5450.00 1.19 1.73 0.00 0.00 1.00 2.00 17.00
CLNO 5738.00 21.30 10.14 0.00 15.00 20.00 26.00 71.00
DEBTINC 4693.00 33.78 8.60 0.52 29.14 34.82 39.00 203.31
In [59]:
data_clean.describe().T
Out[59]:
count mean std min 25% 50% 75% max
BAD 5960.00 0.20 0.40 0.00 0.00 0.00 0.00 1.00
LOAN 5960.00 18607.97 11207.48 1100.00 11100.00 16300.00 23300.00 89900.00
MORTDUE 5442.00 73760.82 44457.61 2063.00 46276.00 65019.00 91488.00 399550.00
VALUE 5848.00 101776.05 57385.78 8000.00 66075.50 89235.50 119824.25 855909.00
YOJ 5445.00 8.92 7.57 0.00 3.00 7.00 13.00 41.00
DEROG 5252.00 0.25 0.85 0.00 0.00 0.00 0.00 10.00
DELINQ 5380.00 0.45 1.13 0.00 0.00 0.00 0.00 15.00
CLAGE 5652.00 179.77 85.81 0.00 115.12 173.47 231.56 1168.23
NINQ 5450.00 1.19 1.73 0.00 0.00 1.00 2.00 17.00
CLNO 5738.00 21.30 10.14 0.00 15.00 20.00 26.00 71.00
DEBTINC 4693.00 33.78 8.60 0.52 29.14 34.82 39.00 203.31
In [60]:
# create function using IQR to treat outliers
def treat_outliers(df, column):
  Q1 = data_clean[column].quantile(0.25)
  Q3 = data_clean[column].quantile(0.75)
  IQR = Q3 - Q1
  lower_bound = Q1 - 1.5 * IQR
  upper_bound = Q3 + 1.5 * IQR
  data_clean[column] = np.where(data_clean[column] < lower_bound, lower_bound, data_clean[column])
  data_clean[column] = np.where(data_clean[column] > upper_bound, upper_bound, data_clean[column])
  return data_clean

# create function to cap extreme outliers
def treat_outliers_extreme(data_clean, column, percentile=0.95):
  cap_value = data_clean[column].quantile(percentile)
  data_clean[column] = np.where(data_clean[column] > cap_value, cap_value, data_clean[column])
  return data_clean
In [61]:
# treat outliers in 'MORTDUE', 'VALUE', 'YOJ', 'CLNO', 'DEBTINC'
for column in ['MORTDUE', 'VALUE', 'YOJ', 'CLNO', 'DEBTINC']:
  data_clean = treat_outliers(data_clean, column)
In [62]:
# treat outliers in 'DEROG', 'DELINQ' and 'NINQ'
for column in ['DEROG', 'DELINQ', 'NINQ']:
  data_clean = treat_outliers_extreme(data_clean, column)
In [63]:
# selecting numerical columns
num_cols = data_clean.select_dtypes('number').columns

# checking summary statistics
sum_stat_treated = data_clean[num_cols].describe().T
sum_stat_treated
Out[63]:
count mean std min 25% 50% 75% max
BAD 5960.00 0.20 0.40 0.00 0.00 0.00 0.00 1.00
LOAN 5960.00 18607.97 11207.48 1100.00 11100.00 16300.00 23300.00 89900.00
MORTDUE 5442.00 71566.09 37203.65 2063.00 46276.00 65019.00 91488.00 159306.00
VALUE 5848.00 98538.06 45070.80 8000.00 66075.50 89235.50 119824.25 200447.38
YOJ 5445.00 8.87 7.43 0.00 3.00 7.00 13.00 28.00
DEROG 5252.00 0.19 0.52 0.00 0.00 0.00 0.00 2.00
DELINQ 5380.00 0.38 0.81 0.00 0.00 0.00 0.00 3.00
CLAGE 5652.00 179.77 85.81 0.00 115.12 173.47 231.56 1168.23
NINQ 5450.00 1.05 1.25 0.00 0.00 1.00 2.00 4.00
CLNO 5738.00 21.03 9.42 0.00 15.00 20.00 26.00 42.50
DEBTINC 4693.00 33.68 7.14 14.35 29.14 34.82 39.00 53.80
  • The outlier treatment was successful in capping the extreme values in the specified numerical columns, bringing them within a more expected range based on the distribution of the majority of the data. This can help improve the performance of certain models that are sensitive to outliers.

Treating Missing Values¶

  1. Improved data quality: A cleaner dataset with fewer missing values is more reiable for analysis and model training.

  2. Enhanced model performance: Properly handling missing values helps models perform better by trainin on coomplete data, leading to more accurate predictions.

  3. Preservation of Data Integrity: Imputing or removing missing values ensures consistency and accuracy in the dataset, maintaing its integrity for further analysis.

  4. Reduced bias: Addressing missing values prevents bias in analysis, ensuring a more accurate representation of the underlying patterns in the data.

In [64]:
# Impute missing values in categorical columns with the mode
for column in ['REASON', 'JOB']:
    if column in data_clean.columns:
        mode_value = data_clean[column].mode()[0]
        data_clean[column] = data_clean[column].fillna(mode_value)

# Impute missing values in numerical columns with the median
numerical_cols_with_missing = ['MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
for column in numerical_cols_with_missing:
     if column in data_clean.columns:
        median_value = data_clean[column].median()
        data_clean[column] = data_clean[column].fillna(median_value)

# Verify that there are no more missing values
display(data_clean.isnull().sum())
0
BAD 0
LOAN 0
MORTDUE 0
VALUE 0
REASON 0
JOB 0
YOJ 0
DEROG 0
DELINQ 0
CLAGE 0
NINQ 0
CLNO 0
DEBTINC 0

Observations:

  • Missing values for categorical columns REASON and JOB were treated using imputation, where missing values were replaced with the mode (most frequent value).
  • Missing values for numerical columns MORTDUE, VALUE, YOJ, DEROG, DELINQ, CLAGE, NINQ, CLNO, DEBTINC were treated using imputation, where missing values were imputed with the median. Median imputation is a robust strategy for numerical features.
In [65]:
# selecting numerical columns
num_cols = data_clean.select_dtypes('number').columns

# checking summary statistics
sum_stat_treated2 = data_clean[num_cols].describe().T
sum_stat_treated2
Out[65]:
count mean std min 25% 50% 75% max
BAD 5960.00 0.20 0.40 0.00 0.00 0.00 0.00 1.00
LOAN 5960.00 18607.97 11207.48 1100.00 11100.00 16300.00 23300.00 89900.00
MORTDUE 5960.00 70997.07 35597.71 2063.00 48139.00 65019.00 88200.25 159306.00
VALUE 5960.00 98363.24 44663.11 8000.00 66489.50 89235.50 119004.75 200447.38
YOJ 5960.00 8.71 7.12 0.00 3.00 7.00 12.00 28.00
DEROG 5960.00 0.17 0.49 0.00 0.00 0.00 0.00 2.00
DELINQ 5960.00 0.34 0.78 0.00 0.00 0.00 0.00 3.00
CLAGE 5960.00 179.44 83.57 0.00 117.37 173.47 227.14 1168.23
NINQ 5960.00 1.04 1.20 0.00 0.00 1.00 2.00 4.00
CLNO 5960.00 20.99 9.25 0.00 15.00 20.00 26.00 42.50
DEBTINC 5960.00 33.92 6.35 14.35 30.76 34.82 37.95 53.80
In [66]:
# Difference of stats after imputing
sum_stat_diff = sum_stat_treated2 - sum_stat_treated

# show stats difference
sum_stat_diff
Out[66]:
count mean std min 25% 50% 75% max
BAD 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
LOAN 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
MORTDUE 518.00 -569.03 -1605.94 0.00 1863.00 0.00 -3287.75 0.00
VALUE 112.00 -174.81 -407.69 0.00 414.00 0.00 -819.50 0.00
YOJ 515.00 -0.16 -0.31 0.00 0.00 0.00 -1.00 0.00
DEROG 708.00 -0.02 -0.03 0.00 0.00 0.00 0.00 0.00
DELINQ 580.00 -0.04 -0.03 0.00 0.00 0.00 0.00 0.00
CLAGE 308.00 -0.33 -2.24 0.00 2.25 0.00 -4.42 0.00
NINQ 510.00 -0.00 -0.05 0.00 0.00 0.00 0.00 0.00
CLNO 222.00 -0.04 -0.18 0.00 0.00 0.00 0.00 0.00
DEBTINC 1267.00 0.24 -0.79 0.00 1.62 0.00 -1.05 0.00

Observations:

  • There were relatively small shifts in the mean, std, and quartiles for most features, suggesting median imputation had a moderate impact on the overal distribution.
In [67]:
# list of numerical features
numerical_col = ['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']

# Compute the correlation matrix for numerical columns
correlation_matrix = data_clean[numerical_col].corr()

# Display the correlation matrix as a heatmap with annotations
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", vmin=-1, vmax=1)
plt.title('Correlation Matrix of Numerical Variables')
plt.show()
No description has been provided for this image
In [68]:
# list of numerical features
numerical_col = ['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']

# Compute the correlation matrix difference after outlier and missing value treatment
correlation_matrix_diff = data_clean[numerical_col].corr() - data[numerical_col].corr()

# Display the correlation matrix diff as a heatmap with annotations
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix_diff, annot=True, cmap='coolwarm', fmt=".2f", vmin=-1, vmax=1)
plt.title('Correlation Matrix of Numerical Variables')
plt.show()
No description has been provided for this image

Observations:

  • The small magnitude of the changes suggests that imputing missing values with the median (for numerical columns) did not drastically alter the overall linear relationships between the numerical features.
  • While there are minor shifts in the correlation coefficients, the fundamental structure of the linear relationships, as observed in the original correlation heatmap, likely remains largely the same.

Important Insights from EDA¶

What are the the most important observations and insights from the data based on the EDA performed?

Summary Statistics

  • The dataset contains 5960 rows and 13 columns, with no duplicate entries.
  • The target variable BAD shows a significant class imbalance (approximately 80% non-default, 20% default).
  • Several columns have missing values, most notably DEBTINC (21.26%) and DEROG (11.88%).
  • Numerical features like LOAN, MORTDUE, VALUE, CLAGE, NINQ, CLNO, and DEBTINC show a wide range of values and presence of outliers.
  • DEROG and DELINQ have low means, with 75% of values being 0, but high maximums indicating significant outliers.

Univariate Analysis

  • Numerical Features: Histograms and boxplots revealed right-skewed distributions for most numerical features, including LOAN, MORTDUE, VALUE, YOJ, DEROG, DELINQ, CLAGE, NINQ, CLNO, and DEBTINC. Outliers are present in many of these features.
    • LOAN, MORTDUE, and VALUE are concentrated at lower values with a tail towards higher amounts.
    • DEROG and DELINQ are heavily concentrated at 0, with few applicants having a high number of derogatory reports or delinquent lines.
    • DEBTINC shows a peak between 30-40%, with a significant number of outliers at very high ratios.
  • Categorical Features: Barplots showed the distribution of BAD, REASON, and JOB.
    • BAD confirms the 80/20 class imbalance.
    • REASON is dominated by 'DebtCon' (69% of non-missing), followed by 'HomeImp' (31%).
    • JOB has 'Other' as the most frequent category (42% of non-missing), followed by 'ProfExe', 'Office', 'Mgr', 'Self', and 'Sales'.

Bivariate Analysis

  • Continuous vs. Continuous: Scatterplots revealed relationships between numerical features.
    • Strong positive correlations exist between MORTDUE and VALUE, and moderate positive correlations between LOAN and both MORTDUE and VALUE.
    • Weak or no strong linear correlations were observed between DEBTINC and LOAN, and DEBTINC and NINQ.
  • Categorical vs. Continuous: Boxplots and histograms of numerical features vs. categorical features (colored by BAD) provided insights into how numerical distributions vary across categories and loan status.
    • Applicants who defaulted tend to have lower median LOAN, MORTDUE, and VALUE compared to those who repaid their loans. The distributions for defaulted loans are more concentrated at the lower end for these features.
    • Defaulted applicants tend to have a significantly higher number of DEROG and DELINQ compared to those who repaid, with a much larger proportion having values greater than 0.
    • Defaulted applicants have a significantly higher median and a wider distribution for DEBTINC compared to non-defaulted applicants. Higher DEBTINC is strongly associated with increased default risk.
    • Applicants who defaulted tend to have a higher number of recent credit inquiries (NINQ) compared to those who repaid, with a larger proportion having values greater than 0.
    • Applicants who defaulted tend to have a slightly lower median years at their current job (YOJ) and a younger oldest credit line (CLAGE) compared to those who repaid.
    • The number of existing credit lines (CLNO) shows a less pronounced difference in median between defaulted and non-defaulted applicants.
    • The distribution of numerical features varies across JOB categories (e.g., 'Self' and 'ProfExe' have higher LOAN/MORTDUE/VALUE medians and wider spreads; 'Sales' and 'Self' have slightly higher NINQ and DEBTINC).
  • Categorical vs. Categorical: Stacked barplots showed relationships between categorical features and BAD.
    • Loans taken for 'HomeImp' appear to have a slightly higher default rate (22.2%) compared to 'DebtCon' loans (19.0%).
    • Default rates vary significantly by JOB type, with 'Sales' (34.9%) and 'Self' (30.1%) showing notably higher default rates compared to 'ProfExe' (16.6%) and 'Office' (13.2%).
    • JOB type is related to REASON for loan (e.g., 'Self' has a higher proportion of 'HomeImp' loans; 'Sales' has a very high proportion of 'DebtCon' loans).

Multivariate Analysis

  • Correlation Heatmap: The correlation heatmap revealed the pairwise linear relationships between all numerical features. Strong positive correlations were observed between MORTDUE and VALUE (0.88). Importantly, it showed positive correlations between BAD and DEROG (0.28), DELINQ (0.35), and DEBTINC (0.20), indicating these features have the strongest linear association with default risk. Weak or no strong linear correlations were observed between BAD and other numerical features like LOAN, MORTDUE, and VALUE.
  • Pairplots (Numerical Features by BAD): Pairplots visually confirmed the relationships observed in bivariate analysis and the correlation heatmap, showing how the distributions and pairwise relationships of numerical features differ between defaulted and non-defaulted loans.
    • Pairplots involving DEROG, DELINQ, and DEBTINC clearly show the defaulted points (orange) being more prevalent at higher values of these features across their relationships with other numerical variables.
    • Pairplots involving LOAN, MORTDUE, and VALUE show the non-defaulted points (blue) extending to higher values, while defaulted points are more concentrated at lower values.
    • The pairplot of credit-related features highlights the strong association of higher DEROG, DELINQ, and DEBTINC with default, and also suggests that defaulted applicants may tend to have slightly lower CLAGE and CLNO and higher NINQ.

Treating Outliers

  • Based on initial observations, outliers were present in most numerical features.
  • BAD and LOAN were not treated for outliers.
  • MORTDUE, VALUE, YOJ, CLNO, and DEBTINC were treated using the IQR method (capping at 1.5*IQR from quartiles).
  • DEROG, DELINQ, and NINQ, which had more extreme skew and important outlier values related to default risk, were treated by capping at the 95th percentile to retain some information from these values.
  • Descriptive statistics after treatment confirmed that the maximum values for the treated columns were successfully capped, with relatively small shifts in other statistics, indicating that the treatment was effective in reducing the influence of extreme values.

Treating Missing Values

  • Missing values were present in several columns.
  • Missing values in categorical columns (REASON, JOB) were imputed with the mode.
  • Missing values in numerical columns (MORTDUE, VALUE, YOJ, DEROG, DELINQ, CLAGE, NINQ, CLNO, DEBTINC) were imputed with the median.
  • Verification after imputation showed that all columns had 0 missing values.
  • The difference in descriptive statistics and correlation matrix before and after imputation showed relatively small changes, suggesting median imputation had a moderate impact on the overall distributions and linear relationships and did not significantly distort the data.

Key Takeaways from EDA

  • Key Predictors: Features like DEROG (derogatory reports), DELINQ (delinquent credit lines), and DEBTINC (debt-to-income ratio) are strongly associated with higher loan default risk. Higher values in these indicate a greater likelihood of default.
  • Property and Loan Value Insights: Counterintuitively, lower values in LOAN, MORTDUE, and VALUE are associated with a higher risk of default.
  • Credit History and Job Tenure: Applicants with more recent credit inquiries (NINQ), fewer years at their current job (YOJ), and a younger oldest credit line (CLAGE) show a slightly higher tendency to default.
  • Job Type and Loan Reason: Job type (JOB) and reason for loan (REASON) are related to default risk, with 'Sales' and 'Self' job categories having higher default rates, and 'HomeImp' loans showing a slightly higher default rate than 'DebtCon'.
  • Class Imbalance: The significant class imbalance in the target variable (BAD) is a critical factor to consider during model building and evaluation.
  • Data Quality Handling: The presence of missing values and outliers was successfully addressed through imputation and capping, respectively, preparing the data for modeling.

Model Building - Approach¶

  • Data preparation: I handled the raw data by treating missing values and addressing outliers. I imputed missing values using medians and modes and addressed outliers by capping extreme values. I will transform categorical features into a numerical format suitable for modeling, and potentially scale numerical features. As part of this preparation, I will also address class imbalance using techniques like class weight and SMOTE.

  • Split data: I will divide the prepared data into training (for building the model) and testing (for evaluating performance on unseen data) sets, ensuring the split is stratified to maintain class proportions.

  • Build model: I will select a suitable classification algorithm (e.g., Logistic Regression, Decision Tree, Random Forest) for the loan default prediction task.

  • Fit model: I will train the selected model using the features and target variable from the training dataset.

  • Tune model: I will optimize the model's hyperparameters to improve its performance, typically using cross-validation on the training data.

  • est model: I will evaluate the final trained and tuned model's performance on the independent test dataset using appropriate metrics (like Recall and F1-score) to assess how well it generalizes.

Model Evaluation Criterion:¶

In the context of loan default prediction, a model's misclassifications can manifest as two types of errors:

False Positives: Occur when the model predicts that an applicant will default, but they would have actually repaid the loan. False Negatives: Occur when the model predicts that an applicant will not default, but they actually do default. For this specific business problem, False Negatives represent a greater risk, as they can lead to significant financial losses for the bank by approving loans to individuals who will ultimately default. While False Positives (rejecting a loan to a potentially good applicant) also carry a cost in terms of lost business opportunities, this is generally considered less severe than the financial impact of a default.

Therefore, a primary objective is to minimize False Negatives. This makes the Recall score a crucial performance evaluation metric. Recall measures the proportion of actual defaulters that the model correctly identifies (True Positives out of all actual positives). A higher Recall score indicates that the model is effective at capturing a larger percentage of true defaulters, thereby reducing False Negatives.

However, it is also important to consider the number of False Positives generated by the model. To balance the need for high recall with the desire to limit false alarms, the F1-Score will also be utilized as a key performance metric. The F1-Score is the harmonic mean of Precision and Recall, providing a balanced measure of the model's performance.

Our approach will involve evaluating various machine learning algorithms based on their performance, with a particular focus on achieving high Recall and a good F1-Score to effectively identify potential defaulters while managing the rate of false positives.

In [69]:
# Function to print classification report and get confusion matrix
def metric_score(actual, predicted):
  report = classification_report(actual, predicted, output_dict=True)
  report_df = pd.DataFrame(report).T
  print(classification_report(actual, predicted))
  cm = confusion_matrix(actual, predicted)
  plt.figure(figsize=(8, 6))
  sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
              xticklabels=['Non-Default', 'Default'],
              yticklabels=['Non-Default', 'Default'])
  plt.xlabel('Predicted')
  plt.ylabel('Actual')
  plt.title('Confusion Matrix')
  plt.show()
  return report_df

Logistic Regression¶

Data Preparation for Logistic Regression Model¶

The data was cleaned with missing values replaced with imputation and outliers were treated with capping. The next step is to encode the categorical variables with one-hot encoding and label encoding. The numerical features will also need to be scaled to ensure they contribute equally to the model performance.

The features will also be checked for multicollinearity with the variance inflation factor.

Then the data will be split into a training set and test set, where the training set will be used to build the model, and the test set will be used to evaluate its performance.

In [70]:
# Creating dummy variables for categorical columns
# One-hot encode categorical variables
data_clean = pd.get_dummies(data_clean, columns=['JOB'], drop_first=True)

# Mapping REASON
reason_mapping = {'DebtCon': 0, 'HomeImp': 1}
data_clean['REASON'] = data_clean['REASON'].map(reason_mapping)

# Convert boolean columns to numeric (0s and 1s)
for col in data_clean.columns:
    if data_clean[col].dtype == 'bool':
        data_clean[col] = data_clean[col].astype(int)

data_clean.sample(10)
Out[70]:
BAD LOAN MORTDUE VALUE REASON YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC JOB_Office JOB_Other JOB_ProfExe JOB_Sales JOB_Self
2993 0 16400 57651.00 76854.00 0 1.00 0.00 0.00 172.20 0.00 25.00 31.82 1 0 0 0 0
5766 0 44900 95410.00 157649.00 0 16.00 0.00 1.00 216.10 4.00 31.00 40.48 0 1 0 0 0
4207 0 21900 53458.00 77455.00 0 10.00 0.00 0.00 192.89 1.00 42.00 40.96 0 0 1 0 0
5080 0 27100 65019.00 31681.00 1 23.00 0.00 0.00 92.51 2.00 15.00 29.74 0 1 0 0 0
110 0 4300 72021.00 80027.00 1 2.00 2.00 0.00 263.97 0.00 5.00 36.43 0 1 0 0 0
1375 1 10700 55043.00 68609.00 0 2.00 0.00 0.00 46.27 1.00 17.00 25.45 0 1 0 0 0
3766 1 20000 65019.00 115750.00 1 7.00 0.00 2.00 132.53 0.00 10.00 34.82 0 1 0 0 0
3720 0 19600 128876.00 162994.00 0 2.00 0.00 1.00 382.31 1.00 42.50 36.07 1 0 0 0 0
2698 0 15300 127172.00 156578.00 0 7.00 0.00 0.00 213.29 0.00 19.00 36.29 0 0 0 0 0
4925 1 26000 43314.00 71303.00 0 8.00 0.00 3.00 246.32 3.00 23.00 43.22 0 1 0 0 0

Observations:

  • JOB has been successfully converted to a numerical format using one-hot encoding.
  • Reason has been succesfully converted to a numerical format using label encoding.
In [71]:
# count datatype values
data_clean.dtypes.value_counts()
Out[71]:
count
float64 9
int64 8

In [72]:
# use VIF to check for multicollinearity

# Separating independent variables and target
X = data_clean.drop('BAD', axis=1)
y = data_clean['BAD']

# add constant for intercept
X = sm.add_constant(X)

# create function to calculate vif
def calculate_vif(X):
    vif = pd.DataFrame()
    vif['variables'] = X.columns
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return vif

# calculate VIF
calculate_vif(X)
Out[72]:
variables VIF
0 const 49.63
1 LOAN 1.20
2 MORTDUE 3.38
3 VALUE 3.66
4 REASON 1.10
5 YOJ 1.08
6 DEROG 1.09
7 DELINQ 1.09
8 CLAGE 1.14
9 NINQ 1.12
10 CLNO 1.30
11 DEBTINC 1.10
12 JOB_Office 1.92
13 JOB_Other 2.60
14 JOB_ProfExe 2.18
15 JOB_Sales 1.14
16 JOB_Self 1.28
  • All values are under 5, so the independance of the predictors isn't violated and there isn't a need to drop any features.
In [73]:
# Separating independent variables and target variable
X = data_clean.drop('BAD', axis=1)
y = data_clean['BAD']
In [74]:
# Split into train and test datasets at a ratio of 70:30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y) # use stratify for class imbalance of target
In [75]:
# Checking shape of train and test datasets
print(X_train.shape)
print(X_test.shape)
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))
(4172, 16)
(1788, 16)
BAD
0   0.80
1   0.20
Name: proportion, dtype: float64
BAD
0   0.80
1   0.20
Name: proportion, dtype: float64
In [76]:
# scale the data with StandardScaler()
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
In [77]:
# Apply SMOTE to oversample minority class in the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

Building a Logistic Regression Model¶

The Logistic Regression model will be initialize and fitted to the training data. Subsequently, its performance will be assessed and evaluated using the classification report and the confusion matrix.

In [78]:
# Build and train Logistic Regression model
# set class_weight to address class imbalance in target and max_iter to set max iterations to converge
LR = LogisticRegression(class_weight={0:0.2, 1:0.8} ,random_state=42, max_iter=1000)

# fit train data to model
LR.fit(X_train_resampled, y_train_resampled)
Out[78]:
LogisticRegression(class_weight={0: 0.2, 1: 0.8}, max_iter=1000,
                   random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(class_weight={0: 0.2, 1: 0.8}, max_iter=1000,
                   random_state=42)

Model Performance Evaluation and Improvement¶

In [79]:
# Checking performance on the train data
y_train_pred = LR.predict(X_train_resampled)

metric_score(y_train_resampled, y_train_pred)
              precision    recall  f1-score   support

           0       0.89      0.30      0.45      3340
           1       0.58      0.96      0.72      3340

    accuracy                           0.63      6680
   macro avg       0.74      0.63      0.59      6680
weighted avg       0.74      0.63      0.59      6680

No description has been provided for this image
Out[79]:
precision recall f1-score support
0 0.89 0.30 0.45 3340.00
1 0.58 0.96 0.72 3340.00
accuracy 0.63 0.63 0.63 0.63
macro avg 0.74 0.63 0.59 6680.00
weighted avg 0.74 0.63 0.59 6680.00

True Negatives (Top-Left): 1008 loans were correctly predicted as non-defaulted.

False Positives (Top-Right): 2332 loans were incorrectly predicted as defaulted (Type I error).

False Negatives (Bottom-Left): 120 loans were incorrectly predicted as non-defaulted when they actually defaulted (Type II error).

True Positives (Bottom-Right): 3220 loans were correctly predicted as defaulted.

Observations:

  • The model, trained on the SMOTE-resampled data, achieved a very high recall for the defaulted class (0.96). This means it is very effective at identifying most of the actual defaulters in the resampled training data. This aligns with the goal of minimizing False Negatives.

  • However, it still has a significant number of false positives (2332) and a lower precision for the defaulted class (0.58), meaning that when it predicts a default, there's a substantial chance it's a false alarm.

  • The performance on the non-defaulted class is weaker, with low recall (0.30), indicating it's missing many non-defaulted loans.

In [80]:
# Checking performance on test data
y_test_pred = LR.predict(X_test_scaled)

lr_test = metric_score(y_test, y_test_pred)
              precision    recall  f1-score   support

           0       0.93      0.31      0.47      1431
           1       0.25      0.91      0.39       357

    accuracy                           0.43      1788
   macro avg       0.59      0.61      0.43      1788
weighted avg       0.79      0.43      0.45      1788

No description has been provided for this image

True Negatives (Top-Left): 448 loans were correctly predicted as non-defaulted.

False Positives (Top-Right): 983 loans were incorrectly predicted as defaulted (Type I error). This is a high number of false alarms on the test set.

False Negatives (Bottom-Left): 33 loans were incorrectly predicted as non-defaulted when they actually defaulted (Type II error). This is a very low number of missed defaulters on the test set.

True Positives (Bottom-Right): 324 loans were correctly predicted as defaulted.

Observations:

  • The model's performance on the test set reflects the trade-off observed on the training set due to the class_weight and SMOTE. It prioritizes identifying defaulters at the expense of precision.

  • For the critical task of identifying defaulters (class 1), the model achieves a very high recall of 0.91 on the unseen test data. This is excellent and means the model is very effective at catching most of the actual defaulters, minimizing costly false negatives.

  • However, the model has a very low precision of 0.25 for class 1 on the test set. This means that when the model predicts a default, it is incorrect 75% of the time (983 false positives vs. 324 true positives). The high number of False Positives could lead to rejecting many good loan applicants.

  • The overall accuracy is low (0.43), primarily because the model incorrectly predicts a large number of non-defaulted loans as defaulted.

  • Will see if adjusting the threshold will improve the model, using a precision-recall curve to optimize for the F1 score, which should give a better balance between precision and recall.

In [81]:
# Get predicted probabilities for the positive class (default)
y_pred_proba = LR.predict_proba(X_train_resampled)

# Compute precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_train_resampled, y_pred_proba[:,1])

# Compute F1 scores for each threshold
f1_scores = [f1_score(y_train_resampled, (y_pred_proba[:,1] >= t).astype(int)) for t in thresholds]

# find the index of the maximum F1 score
max_f1_idx = np.argmax(f1_scores)
optimal_threshold_max_f1 = thresholds[max_f1_idx]
max_f1 = f1_scores[max_f1_idx]
recall_at_max_f1 = recall[max_f1_idx]
precision_at_max_f1 = precision[max_f1_idx]

print(f"Optimal Threshold for Maximum F1: {optimal_threshold_max_f1}")
print(f"Maximum F1: {max_f1}")
print(f"Recall at Maximum F1: {recall_at_max_f1}")
print(f"Precision at Maximum F1: {precision_at_max_f1}")

# Plot precision-recall curve with the optimal threshold
plt.figure(figsize=(10, 7))
plt.plot(thresholds, precision[:-1], 'b--', label='Precision')
plt.plot(thresholds, recall[:-1], 'g--', label='Recall')
plt.axvline(x=optimal_threshold_max_f1, color='r', linestyle='--', label=f'Optimal Threshold: {optimal_threshold_max_f1:.2f}')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Precision-Recall Curve')
plt.legend()
plt.ylim([0, 1])
plt.show()
Optimal Threshold for Maximum F1: 0.6653342282376163
Maximum F1: 0.7667739858338699
Recall at Maximum F1: 0.891317365269461
Precision at Maximum F1: 0.6727683615819209
No description has been provided for this image
In [82]:
# rename model
LR_optimal = LR

#set optimal threshold
optimal_threshold = 0.67

# check performance on the train data
y_train_pred2 = LR_optimal.predict_proba(X_train_resampled)

metric_score(y_train_resampled, y_train_pred2[:,1] > optimal_threshold)
              precision    recall  f1-score   support

           0       0.83      0.57      0.68      3340
           1       0.67      0.88      0.77      3340

    accuracy                           0.73      6680
   macro avg       0.75      0.73      0.72      6680
weighted avg       0.75      0.73      0.72      6680

No description has been provided for this image
Out[82]:
precision recall f1-score support
0 0.83 0.57 0.68 3340.00
1 0.67 0.88 0.77 3340.00
accuracy 0.73 0.73 0.73 0.73
macro avg 0.75 0.73 0.72 6680.00
weighted avg 0.75 0.73 0.72 6680.00

Observations:

  • The precision recall curve helped identify a threshold that gives a good balance between precision and recall.
  • An F1-score of 0.77 at a threshold of 0.67, with a Recall of 0.89 and Precision of 0.67, suggests a reasonably good performance in balancing the identification of defaulters and minimizing false alarms on this oversampled data.
In [83]:
# Checking performance on test data
y_test_pred2 = (LR_optimal.predict_proba(X_test_scaled)[:, 1] > optimal_threshold).astype(int)

lr_opt_test = metric_score(y_test, y_test_pred2)
              precision    recall  f1-score   support

           0       0.92      0.60      0.73      1431
           1       0.33      0.79      0.46       357

    accuracy                           0.64      1788
   macro avg       0.62      0.69      0.60      1788
weighted avg       0.80      0.64      0.67      1788

No description has been provided for this image

Observations:

  • Compared to the base threshold at 0.5, the adjusted threshold at 0.67 had its recall score decrease from 0.91 to 0.79 for the defaulted class, but its precision score improved from 0.25 to 0.33.

  • The F1 score at the adjusted threshold improved to 0.46 from 0.39 of the base threshold, which shows a better balance between recall and precision.

  • However, the trade-off between false positives and false negatives still need to be considered based on the business costs.

  • The Logistic Regression model was the baseline model, and now we can move onto building other predictive models, such as Decision Tree and Random Forest, and compare their performances.

  • But before that, we can calculate the Logistic Regression models coefficients and in turn, the odds ratio. This will help uncover the features that increase or decrease the default risk.

In [84]:
# calculate coefficients and intercept
coefficients = LR_optimal.coef_
intercept = LR_optimal.intercept_

# create dataframe for coefficients
coef_df = pd.DataFrame(coefficients.T, index=X_train.columns, columns=['Coefficient'])

# add intercept
coef_df.loc['Intercept'] = intercept[0]

# sort in descending order by Coefficient
coef_df = coef_df.sort_values(by='Coefficient', ascending=False)

coef_df
Out[84]:
Coefficient
Intercept 0.96
DELINQ 0.89
DEBTINC 0.55
DEROG 0.45
NINQ 0.25
JOB_Sales 0.22
JOB_Self 0.14
JOB_ProfExe 0.09
REASON 0.07
JOB_Other 0.04
MORTDUE -0.02
VALUE -0.07
YOJ -0.09
JOB_Office -0.23
CLNO -0.30
LOAN -0.32
CLAGE -0.46

Observations:

The positive coefficients indicate that an increase in the features value is associated with an increase in the log-odds of loan default. The negative coefficients indicate the opposite.

Here are some key observations from the sorted coefficients:

  • DELINQ has the largest positive coefficient (0.89), suggesting that the number of delinquent credit lines has the strongest positive impact on the log-odds of default.

  • DEBTINC (0.55) and DEROG (0.45) also have significant positive coefficients, indicating that higher debt-to-income ratio and more derogatory reports are associated with a higher log-odds of default.

  • NINQ (0.25) and JOB_Sales (0.22) have smaller positive coefficients.

  • CLAGE (-0.46) has the largest negative coefficient, suggesting that an older age of the oldest credit line has the strongest negative impact on the log-odds of default (i.e., reduces the likelihood of default).

  • LOAN (-0.32) and CLNO (-0.30) also have negative coefficients, indicating that higher loan amounts and more existing credit lines are associated with a lower log-odds of default according to this model.

  • JOB_Office (-0.23) also has a negative coefficient.

The coefficients can be interpreted further by exponentiating them to odds ratios.

In [85]:
# calculate the odds ratios
odds_ratios = np.exp(coef_df)

# sort in descending order
odds_ratios = odds_ratios.sort_values(by='Coefficient', ascending=False)

odds_ratios
Out[85]:
Coefficient
Intercept 2.62
DELINQ 2.44
DEBTINC 1.74
DEROG 1.56
NINQ 1.28
JOB_Sales 1.24
JOB_Self 1.15
JOB_ProfExe 1.09
REASON 1.07
JOB_Other 1.04
MORTDUE 0.98
VALUE 0.94
YOJ 0.91
JOB_Office 0.79
CLNO 0.74
LOAN 0.72
CLAGE 0.63

Observations:

Here are some key observations from the sorted odds ratios:

  • For DELINQ, for each unit increase in the number of delinquent credit lines, the odds of defaulting are 144% higher than if there were no delinquent credit lines, This is a significant increase in risk.

  • For DEBTINC, for each unit increase in the debt-to-income ratio, the odds of defaulting are 74% higher. This also indicates a substantial increase in risk. For DEROG, for each unit increase in the number of major derogatory reports, the odds of defaulting are 56% higher. This is another significant risk factor.

  • For NINQ, for each unit increase in the number of recent credit inquiries, the odds of defaulting are 28% higher.

  • For JOB_Sales, being in a 'Sales' job category compared to not, increases odds of defaulting by 24%.

  • For CLAGE, for each additional month in the oldest credit line, the odds of defaulting are decreased by 337%. This means that an older credit line is associated with a decrease in the odds of default.

  • For LOAN, for each unit increase in the loan amount, the odds of defaulting are decreased by 28%. This suggests that larger loan amounts are associated with a lower chance of defaulting.

  • For CLNO, for each unit increase in the number of existing credit lines, the odds of defaulting are decreased by 26%. holding other features constant. This suggests that having more credit lines is associated with a decrease in the odds of default.

Delinquent credit lines, debt-to-income ratio, and derogatory reports are the most significant risk factors identified by this model, while older credit lines, higher loan amounts, and more existing credit lines are associated with lower risk.

Decision Tree¶

Data Preparation for Decision Tree Model¶

Decision Trees and tree-based ensemble models like Random Forests aren't generally as sensitive as Logistic Regression models are to the scale of features, so there isn't a need to standardize the data with scaling. Unlike Logistic Regression, the decision-making is based on feature importance rather than absolute values.

Similarly, outliers do not need to be given special treatment for Decision Trees, because these models split based on feature values. This makes them robust to extreme values. However, missing values still need to be treated, so we will employ the same imputation techniques used earlier with the Logistic Regression model. We will then also have to encode the categorical variables to a numerical format as done previously.

To finally complete data pre-processing before model building and evaluation, the target variable needs to be separated from the predictor variables, and then the dataset needs to be partitioned into training (building the model) and testing (evaluating the models performance) sets.

In [86]:
# make copy of data df
data_tree = data.copy()

# check info
data_tree.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 605.4+ KB
In [87]:
# Impute missing values in categorical columns with the mode
for column in ['REASON', 'JOB']:
    if column in data_tree.columns:
        mode_value = data_tree[column].mode()[0]
        data_tree[column] = data_tree[column].fillna(mode_value)

# Impute missing values in numerical columns with the median
numerical_cols_with_missing = ['MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
for column in numerical_cols_with_missing:
     if column in data_tree.columns:
        median_value = data_tree[column].median()
        data_tree[column] = data_tree[column].fillna(median_value)
In [88]:
# Encode categorical variables in data_tree

# One-hot encode 'JOB' column
data_tree = pd.get_dummies(data_tree, columns=['JOB'], drop_first=True)

# Map 'REASON' column to numerical values
reason_mapping = {'DebtCon': 0, 'HomeImp': 1}
data_tree['REASON'] = data_tree['REASON'].map(reason_mapping)

# Convert boolean columns to numeric (0s and 1s) if any were created by get_dummies
for col in data_tree.columns:
    if data_tree[col].dtype == 'bool':
        data_tree[col] = data_tree[col].astype(int)

data_tree.sample(10)
Out[88]:
BAD LOAN MORTDUE VALUE REASON YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC JOB_Office JOB_Other JOB_ProfExe JOB_Sales JOB_Self
3095 0 16800 67049.00 89729.00 1 7.00 0.00 0.00 286.38 0.00 20.00 28.74 0 1 0 0 0
3169 0 17000 118822.00 208429.00 0 3.00 0.00 0.00 206.74 0.00 16.00 25.87 0 1 0 0 0
5141 0 27600 65019.00 117581.00 0 10.00 0.00 0.00 165.01 0.00 27.00 35.00 0 0 0 0 0
5331 1 30000 65019.00 46200.00 0 0.00 1.00 3.00 200.93 10.00 15.00 34.82 0 1 0 0 0
4096 0 21400 44573.00 80915.00 0 7.00 0.00 0.00 173.47 1.00 20.00 37.83 0 1 0 0 0
2042 0 13000 61612.00 99132.00 1 8.00 0.00 0.00 262.73 1.00 12.00 34.82 0 0 1 0 0
3865 0 20200 51449.00 69060.00 0 2.00 0.00 0.00 216.22 0.00 12.00 33.33 1 0 0 0 0
5690 0 41100 129281.00 194500.00 0 7.00 1.00 0.00 197.43 0.00 25.00 32.51 0 0 0 0 0
3858 0 20200 137000.00 174685.00 0 6.00 0.00 0.00 183.67 5.00 43.00 34.82 0 0 1 0 0
341 0 6100 65019.00 46830.00 1 0.00 0.00 1.00 173.47 0.00 0.00 13.31 0 1 0 0 0

Observations:

  • The categorical variables JOB and REASON were successfully encoded to numerical dtypes.
In [89]:
# assign data_tree to new variable dt
dt = data_tree
In [90]:
# Separating independent variables and target variable
X = dt.drop('BAD', axis=1)
y = dt['BAD']
In [91]:
# Split into train and test datasets at a ratio of 70:30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y) # use stratify for class imbalance of target

Building a Decision Tree Model¶

In [92]:
# Decision Tree Classifier
dt = DecisionTreeClassifier(class_weight={0:0.2, 1:0.8}, random_state=42) # use class weight to address class imbalance

# fit train data to model
dt.fit(X_train, y_train)
Out[92]:
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=42)

Model Performance Evaluation and Improvement¶

In [93]:
# Checking performance on the train data
y_train_pred = dt.predict(X_train)

metric_score(y_train, y_train_pred)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3340
           1       1.00      1.00      1.00       832

    accuracy                           1.00      4172
   macro avg       1.00      1.00      1.00      4172
weighted avg       1.00      1.00      1.00      4172

No description has been provided for this image
Out[93]:
precision recall f1-score support
0 1.00 1.00 1.00 3340.00
1 1.00 1.00 1.00 832.00
accuracy 1.00 1.00 1.00 1.00
macro avg 1.00 1.00 1.00 4172.00
weighted avg 1.00 1.00 1.00 4172.00

True Negatives: 3340 (all actual non-defaulters were correctly predicted)

False Positives: 0 (no non-defaulters were incorrectly predicted as default)

False Negatives: 0 (no actual defaulters were incorrectly predicted as non-default)

True Positives: 832 (all actual defaulters were correctly predicted)

Observations:

  • The model has performed perfectly on the train data. This is a sign of overfitting where the model has learned more than the general patterns of the data.
In [94]:
# Checking performance on the test data
y_test_pred = dt.predict(X_test)

dt_test = metric_score(y_test, y_test_pred)
              precision    recall  f1-score   support

           0       0.91      0.93      0.92      1431
           1       0.69      0.62      0.65       357

    accuracy                           0.87      1788
   macro avg       0.80      0.78      0.79      1788
weighted avg       0.86      0.87      0.87      1788

No description has been provided for this image

True Negatives (Top-Left): 1331 loans were correctly predicted as non-defaulted.

False Positives (Top-Right): 100 loans were incorrectly predicted as defaulted (Type I error).

False Negatives (Bottom-Left): 135 loans were incorrectly predicted as non-defaulted when they actually defaulted (Type II error).

True Positives (Bottom-Right): 222 loans were correctly predicted as defaulted.

Observations:

  • This confirms that the Decision Tree model overfitted the training data. It learned the training data too well, including the noise, and did not generalize perfectly to unseen data.

  • The decision tree model achieved a recall score of 0.62, which means it correctly predicted 62% of actual defaulters in the test data.

  • The Decision Tree model performs worse on recall compared to the Logistic Regression models, but its higher F1 score (0.65 vs. 0.46 of adjusted LR model) shows it better balances between false positive and false negatives.

  • It is important to address the overfitting in the training data. Hyperparameter tuning can address the overfitting and hopefully improve its weaker class 1 results when compared to the Logistic Regression models.

Decision Tree - Hyperparameter Tuning¶

  • Hyperparameter tuning is tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. We'll use Grid search to perform hyperparameter tuning.
  • Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
  • It is an exhaustive search that is performed on the specific parameter values of a model.
  • The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Criterion {“gini”, “entropy”}

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

max_depth

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leaf

The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

You can learn about more Hyperpapameters on this link and try to tune them.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [95]:
# Create Decision Tree Classifier to be used for tuning
dt_tuned = DecisionTreeClassifier(random_state=42)

# Create parameter grid dictionary for different hyperparameter values to try
dt_parameters = {
    'max_depth': np.arange(2, 10),# controls max number of levels of decision tree
    'criterion': ['gini', 'entropy'], # measuring the quality of a split
    'min_samples_leaf': [5, 10, 15, 20], # minimum number of samples to make a leaf node
    'min_samples_split': [10, 20, 30, 40], # minimum number of samples before node can split
    'class_weight': [{0:0.2, 1:0.8}, {0:0.3, 1:0.7}] # adjust class weights
}

# Score(recall_score) used to compare parameter combinations
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search
grid_search = GridSearchCV(dt_tuned, dt_parameters, scoring = scorer, cv = 5)

grid_search = grid_search.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
dt_tuned = grid_search.best_estimator_

# Fit the best algorithm to the data
dt_tuned.fit(X_train, y_train)

# Display the best hyperparameters found by GridSearchCV
print("Best hyperparameters found by GridSearchCV:")
print(grid_search.best_params_)
Best hyperparameters found by GridSearchCV:
{'class_weight': {0: 0.2, 1: 0.8}, 'criterion': 'gini', 'max_depth': np.int64(7), 'min_samples_leaf': 10, 'min_samples_split': 40}
In [96]:
# Checking tuned performance on the train data
y_train_pred = dt_tuned.predict(X_train)

metric_score(y_train, y_train_pred)
              precision    recall  f1-score   support

           0       0.96      0.88      0.92      3340
           1       0.63      0.85      0.73       832

    accuracy                           0.87      4172
   macro avg       0.80      0.86      0.82      4172
weighted avg       0.89      0.87      0.88      4172

No description has been provided for this image
Out[96]:
precision recall f1-score support
0 0.96 0.88 0.92 3340.00
1 0.63 0.85 0.73 832.00
accuracy 0.87 0.87 0.87 0.87
macro avg 0.80 0.86 0.82 4172.00
weighted avg 0.89 0.87 0.88 4172.00

True Negatives (Top-Left): 2928 loans were correctly predicted as non-defaulted.

False Positives (Top-Right): 412 loans were incorrectly predicted as defaulted (Type I error).

False Negatives (Bottom-Left): 124 loans were incorrectly predicted as non-defaulted when they actually defaulted (Type II error).

True Positives (Bottom-Right): 708 loans were correctly predicted as defaulted.

Observations:

  • Compared to the initial Decision Trees model performance on the training data, the tuned model shows slightly lower performance, which is a good indication that hyperparameter tuning successfully reduced overfitting on the training data. It should generalize better on unseen data.
In [97]:
# Checking tuned model performance on the test data
y_test_pred = dt_tuned.predict(X_test)

dt_tuned_test = metric_score(y_test, y_test_pred)
              precision    recall  f1-score   support

           0       0.93      0.87      0.90      1431
           1       0.58      0.76      0.66       357

    accuracy                           0.84      1788
   macro avg       0.76      0.81      0.78      1788
weighted avg       0.86      0.84      0.85      1788

No description has been provided for this image

True Negatives (Top-Left): 1238 loans were correctly predicted as non-defaulted.

False Positives (Top-Right): 193 loans were incorrectly predicted as defaulted (Type I error).

False Negatives (Bottom-Left): 87 loans were incorrectly predicted as non-defaulted when they actually defaulted (Type II error).

True Positives (Bottom-Right): 270 loans were correctly predicted as defaulted.

Observations:

  • The tuned Decision Tree model's recall for the defaulted class (0.76) is higher than the initial models recall (0.62), which means the tuning helped the model to better identify more actual defaulters on unseen data (false negatives decreased from 135 to 87). However, its precision for the defaulted class (0.58) is lower than the initial models precision (0.69), outputting more false positives (false positives increased to 193 from 100). Its F1 score is slightly higher (0.66 vs. 0.65).

  • The tuned Decision Tree model has comparable recall to the tuned Logistic Regression Model (0.79), but it outperforms on precision (0.53 vs. 0.33) and F1 score (0.63 vs. 0.46). This better balance makes the tuned Decision Tree a more favorable model based on our evaluation metrics.

  • The tuned Decision Tree model provides a good balance between recall and precision for the defaulted class on the test set (Recall 0.79, Precision 0.53, F1 0.63), but we will see if the Random Forest can better perform on the dataset.

  • But before that, lets check the tree plot to reveal insights into its decision-making and a plot of ranked feature importance to see what the model determined were the key indicators in predicting the risk of default.

In [98]:
# Visualize tuned Decision Tree with max depth limited to 4
features = list(X.columns) #store independent variable features in list
plt.figure(figsize = (20, 20))
tree.plot_tree(dt_tuned, feature_names = features, class_names = ['Not Defaulted', 'Defaulted'], max_depth = 4, filled = True, fontsize = 8, node_ids=True)
plt.show()
No description has been provided for this image

Observations:

  • DEBTINC (Debt-to-Income Ratio) is the most important feature as it is used for the very first split at the root node. This confirms the importance of this feature as seen in the correlation analysis. If an applicant's DEBTINC is less than or equal to 34.789, they go to the left branch (Node 1). If it's greater than 34.789, they go to the right branch (Node 34).
  • For applicants with a lower DEBTINC, the next important factor is the number of delinquent credit lines (DELINQ). If DELINQ is 1.5 or less, they go to Node 2. If it's greater than 1.5, they go to Node 25.
  • Following the initial split on DEBTINC, DELINQ (Number of Delinquent Credit Lines) and CLAGE (Age of the Oldest Credit Line) appear as important features in the subsequent splits in the left branch (for lower DEBTINC values).
  • In the right branch (for higher DEBTINC values), DEBTINC appears again for a further split, highlighting its continued importance. DELINQ and CLAGE also appear in subsequent splits in this branch.
In [99]:
# text representation of decision tree
print(tree.export_text(dt_tuned, feature_names=X_train.columns.tolist(), show_weights=True))
|--- DEBTINC <= 34.79
|   |--- DELINQ <= 1.50
|   |   |--- CLAGE <= 64.72
|   |   |   |--- YOJ <= 11.50
|   |   |   |   |--- weights: [4.20, 9.60] class: 1
|   |   |   |--- YOJ >  11.50
|   |   |   |   |--- weights: [2.60, 0.00] class: 0
|   |   |--- CLAGE >  64.72
|   |   |   |--- YOJ <= 5.50
|   |   |   |   |--- MORTDUE <= 37346.50
|   |   |   |   |   |--- LOAN <= 12750.00
|   |   |   |   |   |   |--- weights: [1.40, 8.00] class: 1
|   |   |   |   |   |--- LOAN >  12750.00
|   |   |   |   |   |   |--- weights: [4.80, 2.40] class: 0
|   |   |   |   |--- MORTDUE >  37346.50
|   |   |   |   |   |--- VALUE <= 121967.00
|   |   |   |   |   |   |--- LOAN <= 26000.00
|   |   |   |   |   |   |   |--- weights: [73.00, 5.60] class: 0
|   |   |   |   |   |   |--- LOAN >  26000.00
|   |   |   |   |   |   |   |--- weights: [3.00, 3.20] class: 1
|   |   |   |   |   |--- VALUE >  121967.00
|   |   |   |   |   |   |--- CLNO <= 14.50
|   |   |   |   |   |   |   |--- weights: [1.20, 4.00] class: 1
|   |   |   |   |   |   |--- CLNO >  14.50
|   |   |   |   |   |   |   |--- weights: [14.80, 6.40] class: 0
|   |   |   |--- YOJ >  5.50
|   |   |   |   |--- DEBTINC <= 34.62
|   |   |   |   |   |--- LOAN <= 6050.00
|   |   |   |   |   |   |--- weights: [3.80, 2.40] class: 0
|   |   |   |   |   |--- LOAN >  6050.00
|   |   |   |   |   |   |--- DEROG <= 0.50
|   |   |   |   |   |   |   |--- weights: [173.20, 6.40] class: 0
|   |   |   |   |   |   |--- DEROG >  0.50
|   |   |   |   |   |   |   |--- weights: [12.20, 2.40] class: 0
|   |   |   |   |--- DEBTINC >  34.62
|   |   |   |   |   |--- weights: [2.20, 2.40] class: 1
|   |--- DELINQ >  1.50
|   |   |--- DEBTINC <= 21.46
|   |   |   |--- weights: [0.60, 7.20] class: 1
|   |   |--- DEBTINC >  21.46
|   |   |   |--- VALUE <= 75367.00
|   |   |   |   |--- weights: [4.20, 0.00] class: 0
|   |   |   |--- VALUE >  75367.00
|   |   |   |   |--- YOJ <= 4.50
|   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |--- YOJ >  4.50
|   |   |   |   |   |--- MORTDUE <= 126599.50
|   |   |   |   |   |   |--- weights: [3.20, 10.40] class: 1
|   |   |   |   |   |--- MORTDUE >  126599.50
|   |   |   |   |   |   |--- weights: [2.20, 0.00] class: 0
|--- DEBTINC >  34.79
|   |--- DEBTINC <= 34.82
|   |   |--- DELINQ <= 0.50
|   |   |   |--- CLAGE <= 178.23
|   |   |   |   |--- DEROG <= 0.50
|   |   |   |   |   |--- CLNO <= 20.50
|   |   |   |   |   |   |--- CLNO <= 3.50
|   |   |   |   |   |   |   |--- weights: [0.40, 16.80] class: 1
|   |   |   |   |   |   |--- CLNO >  3.50
|   |   |   |   |   |   |   |--- weights: [14.20, 94.40] class: 1
|   |   |   |   |   |--- CLNO >  20.50
|   |   |   |   |   |   |--- JOB_Office <= 0.50
|   |   |   |   |   |   |   |--- weights: [7.80, 29.60] class: 1
|   |   |   |   |   |   |--- JOB_Office >  0.50
|   |   |   |   |   |   |   |--- weights: [2.40, 0.80] class: 0
|   |   |   |   |--- DEROG >  0.50
|   |   |   |   |   |--- YOJ <= 8.95
|   |   |   |   |   |   |--- CLNO <= 21.50
|   |   |   |   |   |   |   |--- weights: [1.00, 22.40] class: 1
|   |   |   |   |   |   |--- CLNO >  21.50
|   |   |   |   |   |   |   |--- weights: [0.00, 12.00] class: 1
|   |   |   |   |   |--- YOJ >  8.95
|   |   |   |   |   |   |--- weights: [1.20, 12.00] class: 1
|   |   |   |--- CLAGE >  178.23
|   |   |   |   |--- YOJ <= 3.25
|   |   |   |   |   |--- MORTDUE <= 39341.50
|   |   |   |   |   |   |--- weights: [0.60, 5.60] class: 1
|   |   |   |   |   |--- MORTDUE >  39341.50
|   |   |   |   |   |   |--- weights: [5.00, 10.40] class: 1
|   |   |   |   |--- YOJ >  3.25
|   |   |   |   |   |--- CLNO <= 23.50
|   |   |   |   |   |   |--- LOAN <= 7250.00
|   |   |   |   |   |   |   |--- weights: [1.40, 2.40] class: 1
|   |   |   |   |   |   |--- LOAN >  7250.00
|   |   |   |   |   |   |   |--- weights: [12.40, 4.00] class: 0
|   |   |   |   |   |--- CLNO >  23.50
|   |   |   |   |   |   |--- YOJ <= 16.50
|   |   |   |   |   |   |   |--- weights: [6.60, 7.20] class: 1
|   |   |   |   |   |   |--- YOJ >  16.50
|   |   |   |   |   |   |   |--- weights: [1.60, 8.00] class: 1
|   |   |--- DELINQ >  0.50
|   |   |   |--- CLAGE <= 333.28
|   |   |   |   |--- DELINQ <= 2.50
|   |   |   |   |   |--- YOJ <= 23.50
|   |   |   |   |   |   |--- CLAGE <= 92.65
|   |   |   |   |   |   |   |--- weights: [0.40, 29.60] class: 1
|   |   |   |   |   |   |--- CLAGE >  92.65
|   |   |   |   |   |   |   |--- weights: [7.40, 96.00] class: 1
|   |   |   |   |   |--- YOJ >  23.50
|   |   |   |   |   |   |--- weights: [1.20, 4.00] class: 1
|   |   |   |   |--- DELINQ >  2.50
|   |   |   |   |   |--- LOAN <= 15250.00
|   |   |   |   |   |   |--- MORTDUE <= 45512.50
|   |   |   |   |   |   |   |--- weights: [0.00, 14.40] class: 1
|   |   |   |   |   |   |--- MORTDUE >  45512.50
|   |   |   |   |   |   |   |--- weights: [1.60, 28.00] class: 1
|   |   |   |   |   |--- LOAN >  15250.00
|   |   |   |   |   |   |--- weights: [0.00, 39.20] class: 1
|   |   |   |--- CLAGE >  333.28
|   |   |   |   |--- weights: [1.40, 4.00] class: 1
|   |--- DEBTINC >  34.82
|   |   |--- DEBTINC <= 43.68
|   |   |   |--- DELINQ <= 3.50
|   |   |   |   |--- CLAGE <= 178.67
|   |   |   |   |   |--- DEROG <= 0.50
|   |   |   |   |   |   |--- LOAN <= 19550.00
|   |   |   |   |   |   |   |--- weights: [81.60, 44.00] class: 0
|   |   |   |   |   |   |--- LOAN >  19550.00
|   |   |   |   |   |   |   |--- weights: [45.00, 7.20] class: 0
|   |   |   |   |   |--- DEROG >  0.50
|   |   |   |   |   |   |--- DEROG <= 1.50
|   |   |   |   |   |   |   |--- weights: [6.20, 9.60] class: 1
|   |   |   |   |   |   |--- DEROG >  1.50
|   |   |   |   |   |   |   |--- weights: [1.00, 8.80] class: 1
|   |   |   |   |--- CLAGE >  178.67
|   |   |   |   |   |--- DELINQ <= 1.50
|   |   |   |   |   |   |--- CLNO <= 8.50
|   |   |   |   |   |   |   |--- weights: [4.20, 4.00] class: 0
|   |   |   |   |   |   |--- CLNO >  8.50
|   |   |   |   |   |   |   |--- weights: [139.20, 10.40] class: 0
|   |   |   |   |   |--- DELINQ >  1.50
|   |   |   |   |   |   |--- CLAGE <= 195.95
|   |   |   |   |   |   |   |--- weights: [3.20, 0.00] class: 0
|   |   |   |   |   |   |--- CLAGE >  195.95
|   |   |   |   |   |   |   |--- weights: [4.40, 5.60] class: 1
|   |   |   |--- DELINQ >  3.50
|   |   |   |   |--- weights: [2.00, 11.20] class: 1
|   |   |--- DEBTINC >  43.68
|   |   |   |--- CLAGE <= 231.67
|   |   |   |   |--- DEBTINC <= 44.38
|   |   |   |   |   |--- weights: [1.20, 7.20] class: 1
|   |   |   |   |--- DEBTINC >  44.38
|   |   |   |   |   |--- LOAN <= 11050.00
|   |   |   |   |   |   |--- weights: [0.00, 8.80] class: 1
|   |   |   |   |   |--- LOAN >  11050.00
|   |   |   |   |   |   |--- weights: [0.00, 34.40] class: 1
|   |   |   |--- CLAGE >  231.67
|   |   |   |   |--- weights: [3.80, 3.20] class: 0

Key points on most impactful features and how they drive the primary decisions in the tree:

  1. Primary Split (Most Important Feature): The tree first splits based on DEBTINC (Debt-to-Income Ratio).
  • Applicants with a lower DEBTINC (<= 34.79) are initially considered lower risk and go down the left branch.
  • Applicants with a higher DEBTINC (> 34.79) are initially considered higher risk and go down the right branch.
  1. Secondary Splits (Within Main Branches):
  • In the lower DEBTINC branch, the next crucial factor is DELINQ (Number of Delinquent Credit Lines). Fewer delinquencies keep the risk lower, while more delinquencies increase the risk significantly, even with a lower DEBTINC. CLAGE (Age of the Oldest Credit Line) also plays a role in further splits in this branch.
  • In the higher DEBTINC branch, DEBTINC is revisited for further splits, confirming its strong influence. DELINQ and CLAGE are also important in refining the risk assessment within this branch.
  1. Overall Risk Indicators: The tree structure clearly highlights that: Higher DEBTINC is a strong indicator of higher default risk. Higher DELINQ is a significant indicator of higher default risk, regardless of the initial DEBTINC.
  • Higher DEBTINC is a strong indicator of higher default risk.
  • Higher DELINQ is a significant indicator of higher default risk, regardless of the initial DEBTINC.
  • Lower CLAGE (younger credit history) tends to be associated with higher risk.

The tree then uses other features like YOJ, MORTDUE, VALUE, LOAN, DEROG, and CLNO in deeper, more specific splits to fine-tune the risk prediction for individual applicants.

In essence, the model prioritizes DEBTINC, DELINQ, and CLAGE to determine the likelihood of loan default, with higher values in the first two and lower values in the last indicating increased risk.

In [100]:
# Checking tuned model feature importance

importances = dt_tuned.feature_importances_
columns = X.columns

# Sort feature importances in descending order
indices = np.argsort(importances)

# Rearrange feature names so they match the sorted feature importances
columns = columns[indices]

# Create plot
plt.figure(figsize = (10, 10))

# Create plot title
plt.title("Feature Importance")

# Add bars with different colors
colors = plt.cm.viridis(np.linspace(0, 1, len(importances)))
plt.barh(range(X.shape[1]), importances[indices], color=colors)

# Add feature names as y-axis labels
plt.yticks(range(X.shape[1]), columns)

# Show plot
plt.show()
No description has been provided for this image

Observations:

  • DEBTINC (Debt-to-Income Ratio) is by far the most important feature, with the longest bar. This strongly aligns with the Decision Tree plot where DEBTINC was used for the initial split and subsequent splits in the right branch, indicating its dominant role in predicting loan default.

  • CLAGE (Age of the Oldest Credit Line) and DELINQ (Number of Delinquent Credit Lines) are the next most important features, although their importance is significantly less than DEBTINC. This also aligns with their appearance in the Decision Tree splits.

  • YOJ (Years at Present Job) and DEROG (Number of Major Derogatory Reports) are the next in line, showing some influence on the model's predictions.

  • Features like LOAN, MORTDUE, CLNO, and VALUE have lower importance according to this model.

  • The JOB categories (JOB_Office, JOB_Other, JOB_Self, JOB_Sales, JOB_ProfExe) and REASON have the least importance in this tuned Decision Tree model.

  • The feature importance plot reinforces that DEBTINC, CLAGE, and DELINQ are the most influential factors in the tuned Decision Tree models prediction of default. This information is valuable for understanding which aspects of an applicants financial profile are the most critical for assessing risk according to this model.

Random Forest¶

Building a Random Forest Classifier¶

Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample a decision tree makes a prediction.

The results from all the decision trees are combined together and the final prediction is made using voting or averaging.

In [101]:
# Random Forest Classifier
rf = RandomForestClassifier(class_weight={0:0.2, 1:0.8}, random_state=42)

# fit train data to model
rf.fit(X_train, y_train)
Out[101]:
RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=42)
In [102]:
# Checking performance on the train data
y_train_pred = rf.predict(X_train)

metric_score(y_train, y_train_pred)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3340
           1       1.00      1.00      1.00       832

    accuracy                           1.00      4172
   macro avg       1.00      1.00      1.00      4172
weighted avg       1.00      1.00      1.00      4172

No description has been provided for this image
Out[102]:
precision recall f1-score support
0 1.00 1.00 1.00 3340.00
1 1.00 1.00 1.00 832.00
accuracy 1.00 1.00 1.00 1.00
macro avg 1.00 1.00 1.00 4172.00
weighted avg 1.00 1.00 1.00 4172.00

True Negatives (Top-Left): 3340 loans were correctly predicted as non-defaulted.

False Positives (Top-Right): 0 non-defaulted loans were incorrectly predicted as defaulted.

False Negatives (Bottom-Left): 0 defaulted loans were incorrectly predicted as non-defaulted.

True Positives (Bottom-Right): 832 defaulted loans were correctly predicted as defaulted.

Observations:

  • Similar to the initial Decision Tree model, this perfect performance on the training data is a strong indication of overfitting. The model has likely learned the training data too well, including any noise or specific patterns that are not representative of unseen data.
In [103]:
# Checking performance on the test data
y_test_pred = rf.predict(X_test)

rf_test = metric_score(y_test, y_test_pred)
              precision    recall  f1-score   support

           0       0.91      0.97      0.94      1431
           1       0.83      0.62      0.71       357

    accuracy                           0.90      1788
   macro avg       0.87      0.79      0.82      1788
weighted avg       0.89      0.90      0.89      1788

No description has been provided for this image

True Negatives (Top-Left): 1385 loans were correctly predicted as non-defaulted.

False Positives (Top-Right): 46 non-defaulted loans were incorrectly predicted as defaulted (Type I error). This is a relatively low number of false alarms.

False Negatives (Bottom-Left): 136 defaulted loans were incorrectly predicted as non-defaulted (Type II error). This means the model missed 136 actual defaulters.

True Positives (Bottom-Right): 221 defaulted loans were correctly predicted as defaulted.

Observations:

  • The Random Forest model provides a good overall score of balancing catching defaulters with making correct predictions. Between precision (0.83) and recall (0.62), resulting in an F1-score of 0.71.

  • Compared to the tuned Decision Tree model (Recall: 0.76, Precision: 0.58, F1: 0.66), the Random Forest has lower recall for the defaulted class but significantly higher precision. This means the Random Forest is less likely to flag a non-defaulter as a defaulter (fewer false positives) but misses more actual defaulters (more false negatives).

  • Compared to the tuned Logistic Regression model (Recall: 0.79, Precision: 0.33, F1: 0.46), the Random Forest has lower recall but much higher precision and a significantly better F1-score for the defaulted class.

  • We will see if hyperparameter tuning can improve the recall score for the defaulted class.

Random Forest Classifier Hyperparameter Tuning¶

In [104]:
# Create Random Forest Classifier to be used for tuning
rf_tuned = RandomForestClassifier(class_weight={0:0.2, 1:0.8}, criterion = 'gini', random_state=42) # same class_weight and criterion as decision tree

# Create parameter grid dictionary for different hyperparameter values to try
rf_parameters = {
    'n_estimators': [100, 200], # number of trees in the forest
    'max_depth': np.arange(5, 10), # controls max number of levels in each decision tree
    'min_samples_leaf': [10, 20], # minimum number of samples to make a leaf node
    'max_features': ['sqrt', 'log2'], # number of features to consider when looking for the best split

}

# Score(recall_score) used to compare parameter combinations
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search
grid_search = GridSearchCV(rf_tuned, rf_parameters, scoring = scorer, cv = 5)

grid_search = grid_search.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
rf_tuned = grid_search.best_estimator_

# Fit the best algorithm to the data
rf_tuned.fit(X_train, y_train)

# Display the best hyperparameters found by GridSearchCV
print("Best hyperparameters found by GridSearchCV:")
print(grid_search.best_params_)
Best hyperparameters found by GridSearchCV:
{'max_depth': np.int64(7), 'max_features': 'sqrt', 'min_samples_leaf': 10, 'n_estimators': 200}
In [105]:
# Checking tuned model performance on the train data
y_train_pred = rf_tuned.predict(X_train)

metric_score(y_train, y_train_pred)
              precision    recall  f1-score   support

           0       0.97      0.91      0.93      3340
           1       0.70      0.87      0.77       832

    accuracy                           0.90      4172
   macro avg       0.83      0.89      0.85      4172
weighted avg       0.91      0.90      0.90      4172

No description has been provided for this image
Out[105]:
precision recall f1-score support
0 0.97 0.91 0.93 3340.00
1 0.70 0.87 0.77 832.00
accuracy 0.90 0.90 0.90 0.90
macro avg 0.83 0.89 0.85 4172.00
weighted avg 0.91 0.90 0.90 4172.00

True Negatives (Top-Left): 3024 loans were correctly predicted as non-defaulted.

False Positives (Top-Right): 316 non-defaulted loans were incorrectly predicted as defaulted.

False Negatives (Bottom-Left): 109 defaulted loans were incorrectly predicted as non-defaulted.

True Positives (Bottom-Right): 723 defaulted loans were correctly predicted as defaulted.

Observations:

  • The tuned Random Forest model isn't perfect on the training data anymore, like the first one was. That's actually a good thing because it means the tuning helped stop it from just memorizing the training examples (overfitting).

  • On the training data, the tuned model is still really good at finding defaulters (high recall) and is also better at being correct when it predicts a default (improved precision).

  • Essentially, tuning made the model smarter and less likely to be fooled by the training data, which should help it perform better on new loan applications.

In [106]:
# Check tuned model performance on the test data
y_test_pred = rf_tuned.predict(X_test)

rf_tuned_test = metric_score(y_test, y_test_pred)
              precision    recall  f1-score   support

           0       0.94      0.89      0.91      1431
           1       0.63      0.76      0.69       357

    accuracy                           0.86      1788
   macro avg       0.79      0.83      0.80      1788
weighted avg       0.88      0.86      0.87      1788

No description has been provided for this image

True Negatives (Top-Left): 1273 loans were correctly predicted as non-defaulted.

False Positives (Top-Right): 158 non-defaulted loans were incorrectly predicted as defaulted (Type I error).

False Negatives (Bottom-Left): 84 defaulted loans were incorrectly predicted as non-defaulted (Type II error).

True Positives (Bottom-Right): 273 defaulted loans were correctly predicted as defaulted.

Observations:

  • The tuned Random Forest model performed well on the test data, demonstrating good generalization ability after addressing overfitting during tuning. For the crucial task of identifying defaulters (class 1), the tuned Random Forest achieves a Recall of 0.76 and a Precision of 0.63, resulting in an F1-score of 0.69.

  • Compared to the tuned Decision Tree model (Recall: 0.76, Precision: 0.58, F1: 0.66), the tuned Random Forest has the same Recall for the defaulted class but higher Precision and a slightly better F1-score. This means the tuned Random Forest is equally as good at catching defaulters as the tuned Decision Tree, but it makes fewer false positive predictions.

  • Compared to the tuned Logistic Regression model (Recall: 0.79, Precision: 0.33, F1: 0.46), the tuned Random Forest has slightly lower Recall for the defaulted class but significantly higher Precision and a much better F1-score. While the tuned Logistic Regression catches a few more defaulters, it also has a very high rate of false alarms.

  • Considering the importance of balancing minimizing false negatives (catching defaulters) and minimizing false positives (avoiding rejecting good applicants), the tuned Random Forest model appears to be the best performing model among the three evaluated. It offers a strong Recall for the defaulted class while maintaining a reasonably high Precision, resulting in the highest F1-score for the defaulted class on the unseen test data.

In [107]:
# Checking tuned model feature importance

importances = rf_tuned.feature_importances_
columns = X.columns

# Sort feature importances in descending order
indices = np.argsort(importances)

# Rearrange feature names so they match the sorted feature importances
columns = columns[indices]

# Create plot
plt.figure(figsize = (10, 10))

# Create plot title
plt.title("Feature Importance")

# Add bars with different colors
colors = plt.cm.viridis(np.linspace(0, 1, len(importances)))
plt.barh(range(X.shape[1]), importances[indices], color=colors)

# Add feature names as y-axis labels
plt.yticks(range(X.shape[1]), columns)

# Show plot
plt.show()
No description has been provided for this image

Here are the key observations from the plot:

  • DEBTINC (Debt-to-Income Ratio) is, once again, by far the most important feature. This reinforces its dominant role in predicting loan default across different models we've built.

  • DELINQ (Number of Delinquent Credit Lines) and CLAGE (Age of the Oldest Credit Line) are the next most important features, similar to what we observed in the Decision Tree analysis.

  • DEROG (Number of Major Derogatory Reports) follows, also indicating its significance in assessing default risk.

  • Features related to the loan and property values, such as LOAN, VALUE, and MORTDUE, have moderate importance.

  • CLNO (Number of Existing Credit Lines) and YOJ (Years at Present Job) have relatively lower importance compared to the top features.

  • NINQ (Number of Recent Credit Inquiries) and the one-hot encoded JOB categories (JOB_Office, JOB_Reason, JOB_Sales, JOB_Other, JOB_ProfExe, JOB_Self) have the lowest importance in this tuned Random Forest model.

  • The feature importance ranking from the tuned Random Forest is quite similar to the tuned Decision Tree. Both models highlight DEBTINC, DELINQ, and CLAGE as the most influential features. The Random Forest distributes the importance a bit more across a wider range of features compared to the Decision Tree, which had a very steep drop-off after the top few features. This is typical of ensemble models like Random Forest, as they combine the insights from multiple trees.

  • In conclusion, the feature importance plot from the tuned Random Forest model confirms the critical role of debt burden (DEBTINC), credit history (DELINQ, CLAGE, DEROG), and to a lesser extent, loan and property characteristics (LOAN, VALUE, MORTDUE) in predicting loan default.

SMOTE with standard Decision Tree and standard Random Forest¶

SMOTE (Synthetic Minority Over-sampling Technique) is a technique used to address class imbalance in datasets. In classification problems, when one class (the minority class) has significantly fewer instances than the other class (the majority class), models trained on such data tend to be biased towards the majority class. This can lead to poor performance in predicting the minority class, which is often the class of interest (like loan defaulters in this case).

SMOTE works by creating synthetic instances of the minority class. It doesn't just duplicate existing minority class samples; instead, it generates new examples that are combinations of existing minority samples and their nearest neighbors.

Here's how SMOTE typically uses k-nearest neighbors:

  1. Select a Minority Class Instance: SMOTE starts by selecting a random instance from the minority class.
  2. Find K-Nearest Neighbors: It then finds its 'k' nearest neighbors within the minority class (where 'k' is a parameter, often 5).
  3. Create Synthetic Samples: For each selected minority instance, SMOTE randomly chooses one or more of its k-nearest neighbors.
  4. Generate Synthetic Data Point: It then creates a synthetic data point by taking the difference between the selected instance and the chosen neighbor, multiplying this difference by a random number between 0 and 1, and adding this result to the selected instance. This effectively creates a new data point along the line segment between the two original minority instances.

This process is repeated until the minority class is balanced with the majority class, or the desired ratio is achieved. This helps to create a more diverse and representative set of minority class samples compared to simple oversampling by duplication.

Decision Tree with SMOTE¶

In [108]:
# assign data_tree to new variable dt
dt = data_tree
In [109]:
# check info
dt.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   BAD          5960 non-null   int64  
 1   LOAN         5960 non-null   int64  
 2   MORTDUE      5960 non-null   float64
 3   VALUE        5960 non-null   float64
 4   REASON       5960 non-null   int64  
 5   YOJ          5960 non-null   float64
 6   DEROG        5960 non-null   float64
 7   DELINQ       5960 non-null   float64
 8   CLAGE        5960 non-null   float64
 9   NINQ         5960 non-null   float64
 10  CLNO         5960 non-null   float64
 11  DEBTINC      5960 non-null   float64
 12  JOB_Office   5960 non-null   int64  
 13  JOB_Other    5960 non-null   int64  
 14  JOB_ProfExe  5960 non-null   int64  
 15  JOB_Sales    5960 non-null   int64  
 16  JOB_Self     5960 non-null   int64  
dtypes: float64(9), int64(8)
memory usage: 791.7 KB
  • This line simply assigns the data_tree DataFrame (which you had previously cleaned, imputed missing values, and encoded categorical variables) to a new variable named dt.
In [110]:
# Separating independent variables and target variable
X = dt.drop('BAD', axis=1)
y = dt['BAD']
In [111]:
# Apply SMOTE to oversample minority class in the training data
smote = SMOTE(random_state=42)
X_over, y_over = smote.fit_resample(X, y)
  • smote = SMOTE(random_state=42): This initializes a SMOTE object. random_state ensures reproducibility of the synthetic sample generation.
  • X_over, y_over = smote.fit_resample(X, y): This is the core of the SMOTE operation. fit_resample learns the patterns of the minority class from the original data (X, y) and generates synthetic samples for the minority class until it is balanced with the majority class. The resulting resampled datasets are stored in X_over (features) and y_over (target). Now, X_over and y_over contain the original data plus the newly generated synthetic samples for the minority class.
In [112]:
# Split into train and test datasets at a ratio of 70:30
X_over_train, X_over_test, y_over_train, y_over_test = train_test_split(X_over, y_over, test_size=0.3, random_state=42, stratify=y_over) # use stratify for class imbalance of target
  • stratify=y_over: This is important because even after SMOTE, you want to ensure that the proportion of the two classes (default/non-default) is maintained in both the training and testing sets of the oversampled data. This helps in getting a more reliable evaluation of the model's performance on both classes.
  • Train model on the oversampled data using X_over_train and y_over_train for fitting.
In [113]:
# build decision tree classifier with smote
dt_smote = DecisionTreeClassifier(random_state=42)
In [114]:
# split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Train dt_smote
dt_smote.fit(X_over_train, y_over_train)

# predict dt_smote on train set
y_pred_train = dt_smote.predict(X_train)

# evaluate performance
metric_score(y_train, y_pred_train)
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      3340
           1       1.00      0.93      0.97       832

    accuracy                           0.99      4172
   macro avg       0.99      0.97      0.98      4172
weighted avg       0.99      0.99      0.99      4172

No description has been provided for this image
Out[114]:
precision recall f1-score support
0 0.98 1.00 0.99 3340.00
1 1.00 0.93 0.97 832.00
accuracy 0.99 0.99 0.99 0.99
macro avg 0.99 0.97 0.98 4172.00
weighted avg 0.99 0.99 0.99 4172.00

True Negatives (Top-Left): 3340 - All actual non-defaulters in the original training data were correctly predicted as non-defaulted.

False Positives (Top-Right): 0 - No non-defaulters in the original training data were incorrectly predicted as defaulted. This is excellent precision for the non-default class.

False Negatives (Bottom-Left): 55 - 55 actual defaulters in the original training data were incorrectly predicted as non-defaulted.

True Positives (Bottom-Right): 777 - 777 actual defaulters in the original training data were correctly predicted as defaulted.

  • Train model on the oversampled data using X_over_train and y_over_train for fitting and test on the original data using X_test, y_test for prediction.

  • dt_smote.fit(X_over_train, y_over_train): This line trains (fits) the dt_smote Decision Tree model using the oversampled training data. The model learns the patterns from this balanced dataset.

  • y_pred_train = dt_smote.predict(X_train): This line uses the trained model to make predictions on the original training data (X_train from the split of original X and y). This is to check how well the model, trained on oversampled data, performs on the imbalanced training data.

Observations:

  • High Performance on Original Training Data: The model shows very high performance metrics when evaluated on the original training data. This indicates that training on the balanced data using SMOTE has enabled the model to perform very well even on the imbalanced distribution of the training set.

  • Strong Recall for Default: The high recall of 0.93 for the defaulted class on the original training data is a positive sign that the model is learning to identify the minority class effectively.

  • Perfect Precision for Default: The precision of 1.00 for the defaulted class on the original training data is particularly noteworthy, indicating that every time the model predicted a default in this dataset, it was correct.

In [115]:
# predict dt_smote on test set
y_pred_test = dt_smote.predict(X_test)

#evaluate performance
dt_smote_test = metric_score(y_test, y_pred_test)
              precision    recall  f1-score   support

           0       0.98      0.89      0.93      1431
           1       0.67      0.91      0.77       357

    accuracy                           0.89      1788
   macro avg       0.82      0.90      0.85      1788
weighted avg       0.92      0.89      0.90      1788

No description has been provided for this image

True Negatives (Top-Left): 1274 loans were correctly predicted as non-defaulted.

False Positives (Top-Right): 157 loans were incorrectly predicted as defaulted (Type I error).

False Negatives (Bottom-Left): 32 loans were incorrectly predicted as non-defaulted when they actually defaulted (Type II error).

True Positives (Bottom-Right): 325 loans were correctly predicted as defaulted.

  • y_pred_test = dt_smote.predict(X_test): This line uses the trained model (dt_smote) to make predictions on the original testing data (X_test from the split of original X and y).
  • Evaluate performance on the original testing data by comparing the actual target values (y_test) with the model's predictions (y_pred_test). This is a more realistic evaluation of how the model will perform on unseen, imbalanced data in a real-world scenario.

Observations:

  • Excellent Recall on Unseen Data: The high Recall of 0.91 for the defaulted class on the original test data is a significant finding. It means the model is very effective at identifying defaulters in a realistic scenario.

  • Good Precision: The Precision of 0.67 for the defaulted class is also good, especially when considering the balance with Recall. It indicates a reasonable rate of false positives.

  • Strong F1-score: The F1-score of 0.77 for the defaulted class is the highest achieved so far among the models evaluated on the original test data.

  • Effective Use of SMOTE: Comparing these results to the Decision Tree without SMOTE highlights how SMOTE significantly improved the model's ability to predict the minority class on imbalanced data.

In [116]:
# Visualize smote Decision Tree with max depth limited to 4
features = list(X.columns) #store independent variable features in list
plt.figure(figsize = (20, 20))
tree.plot_tree(dt_smote, feature_names = features, class_names = ['Not Defaulted', 'Defaulted'], max_depth = 4, filled = True, fontsize = 8, node_ids=True)
plt.show()
No description has been provided for this image

Observations:

  • Root Node Difference: The most significant difference is at the root node (the first split). The Tuned Decision Tree (without SMOTE) splits on DEBTINC, while the Decision Tree with SMOTE splits on DELINQ. This shows how balancing the data with SMOTE shifted the model's initial focus to DELINQ, a feature highly indicative of default.

  • Consistent Important Features: Despite the root node difference, both trees use similar key features like DEBTINC, CLAGE, and DEROG in subsequent splits, indicating their consistent importance in predicting default risk.

  • SMOTE's Impact: The change in the root node split and the overall better performance on the original test data for the SMOTE-trained tree suggest that balancing the data helped the model learn more effective decision boundaries for identifying defaulters in a real-world, imbalanced scenario.

  • In short, SMOTE changed which feature the Decision Tree prioritized first (from DEBTINC to DELINQ), leading to a model that performs better at identifying loan defaults on imbalanced data.

In [117]:
# Checking smote model feature importance

importances = dt_smote.feature_importances_
columns = X.columns

# Sort feature importances in descending order
indices = np.argsort(importances)

# Rearrange feature names so they match the sorted feature importances
columns = columns[indices]

# Create plot
plt.figure(figsize = (10, 10))

# Create plot title
plt.title("Feature Importance")

# Add bars with different colors
colors = plt.cm.viridis(np.linspace(0, 1, len(importances)))
plt.barh(range(X.shape[1]), importances[indices], color=colors)

# Add feature names as y-axis labels
plt.yticks(range(X.shape[1]), columns)

# Show plot
plt.show()
No description has been provided for this image

Observations:

Based on the plot:

  • DELINQ (Number of Delinquent Credit Lines) is the most important feature. This is indicated by the longest bar at the top. This confirms that the number of delinquent credit lines is a very strong predictor of loan default according to this model.

  • DEBTINC (Debt-to-Income Ratio) is the second most important feature. Its bar is also quite long, highlighting its significant influence on the model's predictions.

  • DEROG (Number of Major Derogatory Reports) is the third most important feature. This reinforces that negative credit history is a key factor in predicting default.

  • CLAGE (Age of the Oldest Credit Line) is the fourth most important feature. The longer the oldest credit line, the less likely a default, as seen in earlier analysis.

  • Features like VALUE, LOAN, YOJ, NINQ, and CLNO have moderate importance.

  • The JOB categories and REASON have relatively low importance in this specific Decision Tree model trained with SMOTE.

Comparison to Tuned Decision Tree (without SMOTE):

  • Both models identify DEBTINC as a top important feature, although its relative dominance is less pronounced in this SMOTE-trained model where DELINQ takes the top spot.

  • DELINQ, DEROG, and CLAGE remain highly important in both models. The relative importance of other features might shift slightly, but the overall picture of credit history and debt burden features being the most influential remains consistent.

Conclusion:

  • The feature importance plot from the Decision Tree model trained with SMOTE confirms that DELINQ, DEBTINC, DEROG, and CLAGE are the most influential factors in predicting loan default according to this model. This aligns with the insights gained from the EDA and the performance metrics, highlighting the importance of these specific aspects of an applicant's financial profile for assessing default risk.

Random Forest with SMOTE¶

In [118]:
# assign data_tree to new variable rf
rf = data_tree
  • This line simply assigns the data_tree DataFrame (which you had previously cleaned, imputed missing values, and encoded categorical variables) to a new variable named rf.
In [119]:
# Separating independent variables and target variable
X = data_tree.drop('BAD', axis=1)
y = data_tree['BAD']
In [120]:
# Apply SMOTE to oversample minority class in the training data
smote = SMOTE(random_state=42)
X_over, y_over = smote.fit_resample(X, y)
  • smote = SMOTE(random_state=42): This initializes a SMOTE object. random_state ensures reproducibility of the synthetic sample generation.
  • X_over, y_over = smote.fit_resample(X, y): This is the core of the SMOTE operation. fit_resample learns the patterns of the minority class from the original data (X, y) and generates synthetic samples for the minority class until it is balanced with the majority class. The resulting resampled datasets are stored in X_over (features) and y_over (target). Now, X_over and y_over contain the original data plus the newly generated synthetic samples for the minority class.
In [121]:
# Split into train and test datasets at a ratio of 70:30
X_over_train, X_over_test, y_over_train, y_over_test = train_test_split(X_over, y_over, test_size=0.3, random_state=42, stratify=y_over) # use stratify for class imbalance of target
  • stratify=y_over: This is important because even after SMOTE, you want to ensure that the proportion of the two classes (default/non-default) is maintained in both the training and testing sets of the oversampled data. This helps in getting a more reliable evaluation of the model's performance on both classes.
  • Train model on the oversampled data using X_over_train and y_over_train for fitting.
In [122]:
# build random forest classifier with smote
rf_smote = RandomForestClassifier(n_estimators = 200, random_state=42)
In [123]:
# split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Train rf_smote
rf_smote.fit(X_over_train, y_over_train)

# predict rf_smote on train set
y_pred_train = rf_smote.predict(X_train)

# evaluate performance
metric_score(y_train, y_pred_train)
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      3340
           1       1.00      0.96      0.98       832

    accuracy                           0.99      4172
   macro avg       1.00      0.98      0.99      4172
weighted avg       0.99      0.99      0.99      4172

No description has been provided for this image
Out[123]:
precision recall f1-score support
0 0.99 1.00 1.00 3340.00
1 1.00 0.96 0.98 832.00
accuracy 0.99 0.99 0.99 0.99
macro avg 1.00 0.98 0.99 4172.00
weighted avg 0.99 0.99 0.99 4172.00

True Negatives (Top-Left): 3340 - All actual non-defaulters in the original training data were correctly predicted as non-defaulted.

False Positives (Top-Right): 0 - No non-defaulters in the original training data were incorrectly predicted as defaulted.

False Negatives (Bottom-Left): 33 - 33 actual defaulters in the original training data were incorrectly predicted as non-defaulted.

True Positives (Bottom-Right): 799 - 799 actual defaulters in the original training data were correctly predicted as defaulted.

  • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y): Similar to the Decision Tree cell, this line splits the original data (X and y) into training and testing sets. This split is used for evaluating the model on the original training data in this cell.

  • rf_smote.fit(X_over_train, y_over_train): This is where the Random Forest model (rf_smote) is trained. It uses the oversampled training data (X_over_train, y_over_train) for fitting. The model learns from this balanced dataset.

  • y_pred_train = rf_smote.predict(X_train): After training on the oversampled data, the model makes predictions on the original training data (X_train). This shows how the model performs on the imbalanced training distribution.

In [124]:
# predict rf_smote on test set
y_pred_test = rf_smote.predict(X_test)

#evaluate performance
rf_smote_test = metric_score(y_test, y_pred_test)
              precision    recall  f1-score   support

           0       0.98      0.96      0.97      1431
           1       0.86      0.92      0.89       357

    accuracy                           0.95      1788
   macro avg       0.92      0.94      0.93      1788
weighted avg       0.95      0.95      0.95      1788

No description has been provided for this image

True Negatives (Top-Left): 1377 loans were correctly predicted as non-defaulted.

False Positives (Top-Right): 54 loans were incorrectly predicted as defaulted (Type I error). This is a low number of false alarms.

False Negatives (Bottom-Left): 30 loans were incorrectly predicted as non-defaulted when they actually defaulted (Type II error). This is a very low number of missed defaulters, crucial for minimizing losses.

True Positives (Bottom-Right): 327 loans were correctly predicted as defaulted.

  • y_pred_test = rf_smote.predict(X_test): This line uses the trained Random Forest model (rf_smote) to make predictions on the original test data (X_test).
  • Evaluate performance on the original testing data by comparing the actual target values (y_test) with the model's predictions (y_pred_test). This is a more realistic evaluation of how the model will perform on unseen, imbalanced data in a real-world scenario.

Observations:

  • Outstanding Performance on Unseen Data: The Random Forest model trained with SMOTE shows excellent performance on the original test data.

  • High Recall and Precision for Default: It achieves both a high Recall (0.92) and a very high Precision (0.86) for the defaulted class. This means it's both very good at identifying defaulters and very accurate when it does predict a default.

  • Exceptional F1-score: The F1-score of 0.89 for the defaulted class is the highest achieved among all the models evaluated on the original test data. This indicates the best overall balance between the critical metrics.

  • Effective Use of SMOTE: Comparing this to the untuned Random Forest model without SMOTE (Recall: 0.62, Precision: 0.83, F1: 0.71), the use of SMOTE significantly improved the Recall while maintaining a high Precision, leading to a much better F1-score.

  • The performance metrics suggests that the Random Forest model trained with SMOTE is the best performing model among those evaluated for this loan default prediction task. Its ability to achieve both high Recall and high Precision for the minority class on unseen, imbalanced data, resulting in a very high F1-score, makes it highly effective in minimizing false negatives while also maintaining a low rate of false positives.

  • Given its superior performance across the critical metrics, this model is the most recommended for the final solution design, balancing the business need to identify defaulters with the need to avoid incorrectly rejecting loan applicants.

In [125]:
# Checking smote model feature importance

importances = rf_smote.feature_importances_
columns = X.columns

# Sort feature importances in descending order
indices = np.argsort(importances)

# Rearrange feature names so they match the sorted feature importances
columns = columns[indices]

# Create plot
plt.figure(figsize = (10, 10))

# Create plot title
plt.title("Feature Importance")

# Add bars with different colors
colors = plt.cm.viridis(np.linspace(0, 1, len(importances)))
plt.barh(range(X.shape[1]), importances[indices], color=colors)

# Add feature names as y-axis labels
plt.yticks(range(X.shape[1]), columns)

# Show plot
plt.show()
No description has been provided for this image

Observations:

Here are the key observations from the plot:

  • DELINQ (Number of Delinquent Credit Lines) is the most important feature. This is indicated by the longest bar at the top. This aligns with what we saw in the Decision Tree with SMOTE and reinforces that the number of delinquent credit lines is a very strong predictor of loan default according to this ensemble model as well.

  • DEBTINC (Debt-to-Income Ratio) is the second most important feature. Its bar is also quite long, highlighting its significant influence on the model's predictions.

  • DEROG (Number of Major Derogatory Reports) is the third most important feature. This reinforces that negative credit history is a key factor in predicting default.

  • NINQ (Number of Recent Credit Inquiries) is the fourth most important feature. This is interesting as its importance is higher in this Random Forest with SMOTE compared to the single Decision Tree with SMOTE. This suggests the ensemble nature of Random Forest might be leveraging this feature more effectively.

  • CLAGE (Age of the Oldest Credit Line) is the fifth most important feature.

  • YOJ (Years at Present Job) has relatively lower importance compared to the top features.

  • The JOB categories and REASON have the lowest importance in this Random Forest model trained with SMOTE.

Comparison to Decision Tree with SMOTE Feature Importance

  • Both models agree on the top three most important features: DELINQ, DEBTINC, and DEROG.
  • The Random Forest assigns a slightly higher importance to NINQ compared to the single Decision Tree.
  • The Random Forest tends to distribute importance a bit more evenly across a larger set of features compared to the single Decision Tree, where the importance drops off more steeply after the top few features. This is characteristic of ensemble methods.

Conclusion:

  • The feature importance plot from the Random Forest model trained with SMOTE confirms the critical role of DELINQ, DEBTINC, and DEROG in predicting loan default. It also highlights the increased importance of NINQ in this ensemble model. This information is valuable for understanding which aspects of an applicant's financial profile are most influential for the best-performing model.

Model Comparison¶

1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):¶

  • How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?
In [126]:
# Extract Class 1 metrics from each classification report DataFrame
lr_test_class1 = lr_test.loc[['1']]
lr_opt_test_class1 = lr_opt_test.loc[['1']]
dt_test_class1 = dt_test.loc[['1']]
dt_tuned_test_class1 = dt_tuned_test.loc[['1']]
rf_test_class1 = rf_test.loc[['1']]
rf_tuned_test_class1 = rf_tuned_test.loc[['1']]
dt_smote_test_class1 = dt_smote_test.loc[['1']]
rf_smote_test_class1 = rf_smote_test.loc[['1']]


# Compare Class 1 metrics for all models
models_test_comp_df = pd.concat(
    [
        lr_test_class1,
        lr_opt_test_class1,
        dt_test_class1,
        dt_tuned_test_class1,
        rf_test_class1,
        rf_tuned_test_class1,
        dt_smote_test_class1,
        rf_smote_test_class1,
    ],
    axis=0  # Concatenate along rows
)

# Set the index (row labels) to the model names
models_test_comp_df.index = [
    "Logistic Regression (Untuned)",
    "Logistic Regression (Tuned)",
    "Decision Tree (Untuned)",
    "Decision Tree (Tuned)",
    "Random Forest (Untuned)",
    "Random Forest (Tuned)",
    "Decision Tree (Tuned) with SMOTE",
    "Random Forest (Tuned) with SMOTE",
]

print("Test performance comparison (Class 1 metrics):")
display(models_test_comp_df)
Test performance comparison (Class 1 metrics):
precision recall f1-score support
Logistic Regression (Untuned) 0.25 0.91 0.39 357.00
Logistic Regression (Tuned) 0.33 0.79 0.46 357.00
Decision Tree (Untuned) 0.69 0.62 0.65 357.00
Decision Tree (Tuned) 0.58 0.76 0.66 357.00
Random Forest (Untuned) 0.83 0.62 0.71 357.00
Random Forest (Tuned) 0.63 0.76 0.69 357.00
Decision Tree (Tuned) with SMOTE 0.67 0.91 0.77 357.00
Random Forest (Tuned) with SMOTE 0.86 0.92 0.89 357.00

The key measures of success are focused on effectively identifying loan defaulters while managing false positives. Therefore, the important metrics to consider are:

  • Recall (Class 1 - Default): Measures the proportion of actual defaulters correctly identified. High Recall is crucial to minimize missed defaulters (false negatives).
  • Precision (Class 1 - Default): Measures the proportion of predicted defaulters that are actually defaulters. Important to limit false positives (incorrectly flagging non-defaulters).
  • F1-score (Class 1 - Default): Provides a balanced measure of the model's performance on the defaulted class (harmonic mean of Precision and Recall). Higher F1 indicates a better balance between catching defaulters and avoiding false alarms.

Now, let's compare the performance of the different models on the test data (Class 1 metrics):

  • Untuned Models: Untuned Logistic Regression had high Recall but very low Precision/F1. Untuned Decision Tree offered a better F1 score with higher Precision but lower Recall. Untuned Random Forest had the highest Precision/F1 but low Recall.
  • Tuned Models: Hyperparameter tuning generally improved F1-scores and balanced Precision/Recall compared to untuned versions, but none achieved significantly high Recall while maintaining good Precision. Tuned Random Forest had the highest F1 (0.69) among models without SMOTE.
  • Models with SMOTE: SMOTE significantly improved performance on the defaulted class.
    • Decision Tree with SMOTE achieved a strong F1 (0.77) with high Recall (0.91) and good Precision (0.67).
    • Random Forest with SMOTE achieved outstanding performance with the highest F1 (0.89), very high Recall (0.92), and very high Precision (0.86).

Which one is performing relatively better and why?

Considering high Recall and good F1-score for Class 1:

  • The Random Forest with SMOTE is performing significantly better, achieving the highest F1-score (0.89) and very high Recall (0.92) and Precision (0.86). This model offers the best balance for effectively identifying defaulters while minimizing false positives on the imbalanced test data.
  • The Decision Tree with SMOTE is a strong alternative with a good F1 (0.77) and high Recall (0.91), especially if interpretability is prioritized, though its performance is notably lower than the Random Forest with SMOTE.

Is there scope to improve the performance further?

Yes, further improvements are possible through more extensive hyperparameter tuning, advanced feature engineering, exploring Gradient Boosting models (XGBoost, LightGBM), or ensembling techniques.

Which variables are significant in predicting the target variable? Are variables still continue to be significant post modelling?

EDA identified DEBTINC, DELINQ, DEROG, CLAGE, and NINQ as significant. Post-modeling feature importance analysis consistently confirmed DELINQ, DEBTINC, and DEROG as the most influential predictors across various models, with CLAGE and NINQ also remaining important. The significance of these variables was largely consistent before and after modeling.